Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
guipsamora
GitHub Repository: guipsamora/pandas_exercises
Path: blob/master/06_Stats/US_Baby_Names/Solutions.ipynb
548 views
Kernel: Python [default]

US - Baby Names

Introduction:

We are going to use a subset of US Baby Names from Kaggle. In the file it will be names from 2004 until 2014

Step 1. Import the necessary libraries

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called baby_names.

<class 'pandas.core.frame.DataFrame'> RangeIndex: 1016395 entries, 0 to 1016394 Data columns (total 7 columns): Unnamed: 0 1016395 non-null int64 Id 1016395 non-null int64 Name 1016395 non-null object Year 1016395 non-null int64 Gender 1016395 non-null object State 1016395 non-null object Count 1016395 non-null int64 dtypes: int64(4), object(3) memory usage: 54.3+ MB

Step 4. See the first 10 entries

Step 5. Delete the column 'Unnamed: 0' and 'Id'

Step 6. Are there more male or female names in the dataset?

F 558846 M 457549 Name: Gender, dtype: int64

Step 7. Group the dataset by name and assign to names

(17632, 1)

Step 8. How many different names exist in the dataset?

17632

Step 9. What is the name with most occurrences?

'Jacob'

Step 10. How many different names have the least occurrences?

2578

Step 11. What is the median name occurrence?

Step 12. What is the standard deviation of names?

11006.069467891111

Step 13. Get a summary with the mean, min, max, std and quartiles.