GitHub Repository: ycchen00/Introduction-to-Data-Science-in-Python
Path: blob/main/assignments/assignment2/assignment2.ipynb
³²²³ views

Kernel: Python 3

Assignment 2

For this assignment you'll be looking at 2017 data on immunizations from the CDC. Your datafile for this assignment is in assets/NISPUF17.csv. A data users guide for this, which you'll need to map the variables in the data to the questions being asked, is available at assets/NIS-PUF17-DUG.pdf. Note: you may have to go to your Jupyter tree (click on the Coursera image) and navigate to the assignment 2 assets folder to see this PDF file).

Question 1

Write a function called proportion_of_education which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

This function should return a dictionary in the form of (use the correct numbers, do not round numbers):

    {"less than high school":0.2,
    "high school":0.4,
    "more than high school but not college":0.2,
    "college":0.2}

test code

In [1]:

import pandas as pd
df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
df.head()

Out[1]:

In [2]:

df.columns

Out[2]:

Index(['SEQNUMC', 'SEQNUMHH', 'PDAT', 'PROVWT_D', 'RDDWT_D', 'STRATUM', 'YEAR',
       'AGECPOXR', 'HAD_CPOX', 'AGEGRP',
       ...
       'XVRCTY2', 'XVRCTY3', 'XVRCTY4', 'XVRCTY5', 'XVRCTY6', 'XVRCTY7',
       'XVRCTY8', 'XVRCTY9', 'INS_STAT2_I', 'INS_BREAK_I'],
      dtype='object', length=453)

In [3]:

EDUS=df['EDUC1']
EDUS

Out[3]:

      4
      3
      3
      4
      1
        ..
  3
  2
  3
  2
  4
Name: EDUC1, Length: 28465, dtype: int64

In [4]:

import numpy as np
edus=np.sort(EDUS.values)
edus

Out[4]:

array([1, 1, 1, ..., 4, 4, 4], dtype=int64)

In [5]:

poe={"less than high school":0,
    "high school":0,
    "more than high school but not college":0,
    "college":0}

In [6]:

n=len(edus)
poe["less than high school"]=np.sum(edus==1)/n
poe["high school"]=np.sum(edus==2)/n
poe["more than high school but not college"]=np.sum(edus==3)/n
poe["college"]=np.sum(edus==4)/n

In [7]:

poe

Out[7]:

{'college': 0.47974705779026877,
 'high school': 0.172352011241876,
 'less than high school': 0.10202002459160373,
 'more than high school but not college': 0.24588090637625154}

answer

In [8]:

def proportion_of_education():
    # your code goes here
    # YOUR CODE HERE
    # raise NotImplementedError()
    import pandas as pd
    import numpy as np
    df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
    EDUS=df['EDUC1']
    edus=np.sort(EDUS.values)
    poe={"less than high school":0,
        "high school":0,
        "more than high school but not college":0,
        "college":0}
    n=len(edus)
    poe["less than high school"]=np.sum(edus==1)/n
    poe["high school"]=np.sum(edus==2)/n
    poe["more than high school but not college"]=np.sum(edus==3)/n
    poe["college"]=np.sum(edus==4)/n
    return poe
proportion_of_education()

Out[8]:

{'college': 0.47974705779026877,
 'high school': 0.172352011241876,
 'less than high school': 0.10202002459160373,
 'more than high school but not college': 0.24588090637625154}

In [9]:

assert type(proportion_of_education())==type({}), "You must return a dictionary."
assert len(proportion_of_education()) == 4, "You have not returned a dictionary with four items in it."
assert "less than high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "more than high school but not college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."

Question 2

Let's explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider. Return a tuple of the average number of influenza vaccines for those children we know received breastmilk as a child and those who know did not.

This function should return a tuple in the form (use the correct numbers:

(2.5, 0.1)

test code

In [10]:

import pandas as pd
df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
df.head()

Out[10]:

In [11]:

cbf_flu=df.loc[:,['CBF_01','P_NUMFLU']]
cbf_flu

Out[11]:

In [12]:

cbf_flu
cbf_flu1=cbf_flu[cbf_flu['CBF_01'] ==1].dropna()
cbf_flu2=cbf_flu[cbf_flu['CBF_01'] ==2].dropna()

In [13]:

flu2=cbf_flu2['P_NUMFLU'].values.copy()
flu2[np.isnan(flu2)] = 0
f2=np.sum(flu2)/len(flu2)

In [14]:

flu1=cbf_flu1['P_NUMFLU'].values.copy()
flu1[np.isnan(flu1)] = 0
f1=np.sum(flu1)/len(flu1)

In [15]:

cbf_flu2['P_NUMFLU'].values

Out[15]:

array([3., 0., 3., ..., 1., 1., 2.])

In [16]:

cbf_flu1

Out[16]:

In [17]:

# cbf=df['CBF_01'].values
# (np.sum(cbf)-(np.sum(cbf==2)*2+np.sum(cbf==1)+np.sum(cbf==99)*99))/(len(cbf)-(np.sum(cbf==2)+np.sum(cbf==1)+np.sum(cbf==99)))

In [18]:

aid =(f1,f2)
aid

Out[18]:

(1.8799187420058687, 1.5963945918878317)

answer

In [19]:

def average_influenza_doses():
    # YOUR CODE HERE
    # raise NotImplementedError()
    import pandas as pd
    import numpy as np
    df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
    
    cbf_flu=df.loc[:,['CBF_01','P_NUMFLU']]
    
    
    cbf_flu1=cbf_flu[cbf_flu['CBF_01'] ==1].dropna()
    cbf_flu2=cbf_flu[cbf_flu['CBF_01'] ==2].dropna()
    
    flu1=cbf_flu1['P_NUMFLU'].values.copy()
    flu1[np.isnan(flu1)] = 0
    f1=np.sum(flu1)/len(flu1)
    
    flu2=cbf_flu2['P_NUMFLU'].values.copy()
    flu2[np.isnan(flu2)] = 0
    f2=np.sum(flu2)/len(flu2)
    
    aid =(f1,f2)
    return aid
average_influenza_doses()

Out[19]:

(1.8799187420058687, 1.5963945918878317)

In [20]:

assert len(average_influenza_doses())==2, "Return two values in a tuple, the first for yes and the second for no."

Question 3

It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child. Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. Return results by sex.

This function should return a dictionary in the form of (use the correct numbers):

    {"male":0.2,
    "female":0.4}

Note: To aid in verification, the chickenpox_by_sex()['female'] value the autograder is looking for starts with the digits 0.0077.

test code

In [21]:

import pandas as pd
df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
df.head()

Out[21]:

In [22]:

cpo_vrc_sex=df.loc[:,['HAD_CPOX','P_NUMVRC','SEX']]
cpo_sex=cpo_vrc_sex[cpo_vrc_sex['P_NUMVRC'].gt(0)].loc[:,['HAD_CPOX','SEX']]

In [23]:

cpo_sex

Out[23]:

In [24]:

cpo_sex=df[df['P_NUMVRC'].gt(0) & df['HAD_CPOX'].lt(3)].loc[:,['HAD_CPOX','SEX']].dropna()

In [25]:

np.min(cpo_sex.values[:,0])

Out[25]:

1

In [26]:

#Male 1 Female 2
cpo1_sex1=len(cpo_sex[(cpo_sex['HAD_CPOX']==1) & (cpo_sex['SEX']==1)])
cpo1_sex2=len(cpo_sex[(cpo_sex['HAD_CPOX']==1) & (cpo_sex['SEX']==2)])
cpo2_sex1=len(cpo_sex[(cpo_sex['HAD_CPOX']==2) & (cpo_sex['SEX']==1)])
cpo2_sex2=len(cpo_sex[(cpo_sex['HAD_CPOX']==2) & (cpo_sex['SEX']==2)])

In [27]:

cbs={"male":0,
    "female":0}
cbs['male']=cpo1_sex1/(cpo1_sex1+cpo2_sex1)
cbs['female']=cpo1_sex2/(cpo1_sex2+cpo2_sex2)

In [28]:

cbs

Out[28]:

{'female': 0.007731582786287381, 'male': 0.009582863585118376}

answer

In [29]:

def chickenpox_by_sex():
    # YOUR CODE HERE
    # raise NotImplementedError()
    import pandas as pd
    import numpy as np
    df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
    
    cpo_sex=df[df['P_NUMVRC'].gt(0) & df['HAD_CPOX'].lt(3)].loc[:,['HAD_CPOX','SEX']]
    #Male 1 Female 2
    cpo1_sex1=len(cpo_sex[(cpo_sex['HAD_CPOX']==1) & (cpo_sex['SEX']==1)])
    cpo1_sex2=len(cpo_sex[(cpo_sex['HAD_CPOX']==1) & (cpo_sex['SEX']==2)])
    cpo2_sex1=len(cpo_sex[(cpo_sex['HAD_CPOX']==2) & (cpo_sex['SEX']==1)])
    cpo2_sex2=len(cpo_sex[(cpo_sex['HAD_CPOX']==2) & (cpo_sex['SEX']==2)])
    
    cbs={"male":0,
        "female":0}
    cbs['male']=cpo1_sex1/cpo2_sex1
    cbs['female']=cpo1_sex2/cpo2_sex2
    return cbs

In [30]:

chickenpox_by_sex()

Out[30]:

{'female': 0.0077918259335489565, 'male': 0.009675583380762664}

In [31]:

assert len(chickenpox_by_sex())==2, "Return a dictionary with two items, the first for males and the second for females."

Question 4

A correlation is a statistical relationship between two variables. If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease [1]. In this question, you are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).

Some notes on interpreting the answer. The had_chickenpox_column is either 1 (for yes) or 2 (for no), and the num_chickenpox_vaccine_column is the number of doses a child has been given of the varicella vaccine. A positive correlation (e.g., corr > 0) means that an increase in had_chickenpox_column (which means more no’s) would also increase the values of num_chickenpox_vaccine_column (which means more doses of vaccine). If there is a negative correlation (e.g., corr < 0), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, pval is the probability that we observe a correlation between had_chickenpox_column and num_chickenpox_vaccine_column which is greater than or equal to a particular value occurred by chance. A small pval means that the observed correlation is highly unlikely to occur by chance. In this case, pval should be very small (will end in e-18 indicating a very small number).

[1] This isn’t really the full picture, since we are not looking at when the dose was given. It’s possible that children had chickenpox and then their parents went to get them the vaccine. Does this dataset have the data we would need to investigate the timing of the dose?

test code

In [32]:

# !pip install scipy

In [33]:

import scipy.stats as stats
import numpy as np
import pandas as pd

# this is just an example dataframe
# df=pd.DataFrame({"had_chickenpox_column":np.random.randint(1,3,size=(100)),
#                "num_chickenpox_vaccine_column":np.random.randint(0,6,size=(100))})
df = pd.read_csv("assets/NISPUF17.csv", index_col=0)

df=df[df['HAD_CPOX'].lt(3)].loc[:,['HAD_CPOX','P_NUMVRC']].dropna()
df.columns=['had_chickenpox_column','num_chickenpox_vaccine_column']
# here is some stub code to actually run the correlation
corr, pval=stats.pearsonr(df["had_chickenpox_column"],df["num_chickenpox_vaccine_column"])

In [34]:

df

Out[34]:

In [35]:

# df['P_NUMVRC'].unique()
# df['HAD_CPOX'].unique()

In [36]:

corr, pval

Out[36]:

(0.07044873460148118, 2.778026318286582e-18)

answer

In [37]:

def corr_chickenpox():
    import scipy.stats as stats
    import numpy as np
    import pandas as pd
    
    # this is just an example dataframe
    # df=pd.DataFrame({"had_chickenpox_column":np.random.randint(1,3,size=(100)),
    #             "num_chickenpox_vaccine_column":np.random.randint(0,6,size=(100))})
    df = pd.read_csv("assets/NISPUF17.csv", index_col=0)

    df=df[df['HAD_CPOX'].lt(3)].loc[:,['HAD_CPOX','P_NUMVRC']].dropna()
    df.columns=['had_chickenpox_column','num_chickenpox_vaccine_column']
    # here is some stub code to actually run the correlation
    corr, pval=stats.pearsonr(df["had_chickenpox_column"],df["num_chickenpox_vaccine_column"])
    
    # just return the correlation
    return corr
corr_chickenpox()

Out[37]:

0.07044873460148118

In [38]:

assert -1<=corr_chickenpox()<=1, "You must return a float number between -1.0 and 1.0."

Assignment 2

Question 1

test code

answer

Question 2

test code

answer

Question 3

test code

answer

Question 4

test code

answer

Product

Resources

Company