Path: blob/master/06_Stats/US_Baby_Names/Solutions.ipynb
548 views
Kernel: Python [default]
US - Baby Names
Introduction:
We are going to use a subset of US Baby Names from Kaggle. In the file it will be names from 2004 until 2014
Step 1. Import the necessary libraries
In [1]:
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called baby_names.
In [2]:
Out[2]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
Unnamed: 0 1016395 non-null int64
Id 1016395 non-null int64
Name 1016395 non-null object
Year 1016395 non-null int64
Gender 1016395 non-null object
State 1016395 non-null object
Count 1016395 non-null int64
dtypes: int64(4), object(3)
memory usage: 54.3+ MB
Step 4. See the first 10 entries
In [3]:
Out[3]:
Step 5. Delete the column 'Unnamed: 0' and 'Id'
In [4]:
Out[4]:
Step 6. Are there more male or female names in the dataset?
In [5]:
Out[5]:
F 558846
M 457549
Name: Gender, dtype: int64
Step 7. Group the dataset by name and assign to names
In [6]:
Out[6]:
(17632, 1)
Step 8. How many different names exist in the dataset?
In [7]:
Out[7]:
17632
Step 9. What is the name with most occurrences?
In [8]:
Out[8]:
'Jacob'
Step 10. How many different names have the least occurrences?
In [9]:
Out[9]:
2578
Step 11. What is the median name occurrence?
In [10]:
Out[10]:
Step 12. What is the standard deviation of names?
In [11]:
Out[11]:
11006.069467891111
Step 13. Get a summary with the mean, min, max, std and quartiles.
In [12]:
Out[12]: