Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
guipsamora
GitHub Repository: guipsamora/pandas_exercises
Path: blob/master/06_Stats/US_Baby_Names/Solutions.ipynb
821 views
Kernel: Python 3

US - Baby Names

Introduction:

We are going to use a subset of US Baby Names from Kaggle. In the file it will be names from 2004 until 2014

Step 1. Import the necessary libraries

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called baby_names.

<class 'pandas.core.frame.DataFrame'> RangeIndex: 1016395 entries, 0 to 1016394 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1016395 non-null int64 1 Id 1016395 non-null int64 2 Name 1016395 non-null object 3 Year 1016395 non-null int64 4 Gender 1016395 non-null object 5 State 1016395 non-null object 6 Count 1016395 non-null int64 dtypes: int64(4), object(3) memory usage: 54.3+ MB

Step 4. See the first 10 entries

Step 5. Delete the column 'Unnamed: 0' and 'Id'

Step 6. What year has the highest number of baby names in the dataset?

Count 2007 dtype: int64

Step 7. Are there more male or female names in the dataset?

F 558846 M 457549 Name: Gender, dtype: int64

Step 8. Group the dataset by name and assign to names

(17632, 1)

Step 9. How many different names exist in the dataset?

17632

Step 10. What is the name with most occurrences?

'Jacob'

Step 11. How many different names have the least occurrences?

2578

Step 12. What is the median name occurrence?

Step 13. What is the standard deviation of names?

11006.069467891111

Step 14. Get a summary with the mean, min, max, std and quartiles.