GitHub Repository: ycchen00/Introduction-to-Data-Science-in-Python
Path: blob/main/quiz/quiz3.ipynb
³²²³ views

Kernel: Python 3

Q1

Consider the two DataFrames shown below, both of which have Name as the index. Which of the following expressions can be used to get the data of all students (from student_df) including their roles as staff, where nan denotes no role? MergingDataFrame_ed

In [1]:

import numpy as np
import pandas as pd

In [2]:

# First we create two DataFrames, staff and students.
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
# And lets index these staff by name
staff_df = staff_df.set_index('Name')
# Now we'll create a student dataframe
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
# And we'll index this by name too
student_df = student_df.set_index('Name')

In [3]:

staff_df

Out[3]:

In [4]:

student_df

Out[4]:

In [5]:

pd.merge(student_df, staff_df, how='right', left_index=True, right_index=True)

Out[5]:

In [6]:

# Correct
pd.merge(student_df, staff_df, how='left', left_index=True, right_index=True)

Out[6]:

In [7]:

pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

Out[7]:

In [8]:

# pd.merge(staff_df, student_df, how='right', left_index=False, right_index=True)
print('Wrong! : Must pass left_on or left_index=True')

Out[8]:

Wrong! : Must pass left_on or left_index=True

Q2

Consider a DataFrame named df with columns named P2010, P2011, P2012, P2013, P2014 and P2015 containing float values. We want to use the apply method to get a new DataFrame named result_df with a new column AVG. The AVG column should average the float values across P2010 to P2015. The apply method should also remove the 6 original columns (P2010 to P2015). For that, what should be the value of x and y in the given code? PandasIdioms_ed

In [9]:

df = pd.read_csv('../resources/week-3/datasets/census.csv') \
        .rename(columns={ \
        'POPESTIMATE2010': 'P2010',
        'POPESTIMATE2011': 'P2011',
        'POPESTIMATE2012': 'P2012',
        'POPESTIMATE2013': 'P2013',
        'POPESTIMATE2014': 'P2014',
        'POPESTIMATE2015': 'P2015'}).dropna() \
        [['P2010', 'P2011', 'P2012', 'P2013','P2014', 'P2015']]
#       [['POPESTIMATE2010',
#         'POPESTIMATE2011',
#         'POPESTIMATE2012',
#         'POPESTIMATE2013',
#         'POPESTIMATE2014',
#         'POPESTIMATE2015']] \

df.head()

Out[9]:

In [10]:

# axis = 1  == axis = 'columns'
x=1
y=1

frames = ['P2010', 'P2011', 'P2012', 'P2013','P2014', 'P2015']
df['AVG'] = df[frames].apply(lambda z: np.mean(z), axis=x)
result_df = df.drop(frames,axis=y)

result_df .head()

Out[10]:

Q3

Consider the Dataframe df below, instatiated with a list of grades, ordered from best grade to worst. Which of the following options can be used to substitute X in the code given below, if we want to get all the grades between 'A' and 'B' where 'A' is better than 'B'? Scales

In [11]:

import pandas as pd

df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'], index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor'], columns = ['Grades'])
df

Out[11]:

In [12]:

# Correct
my_categories= pd.CategoricalDtype(categories=['D','D+','C-','C','C+','B-','B','B+','A-','A','A+'], ordered=True)

In [13]:

# my_categories= pd.CategoricalDtype(categories=['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'])
print('ERROE! Unordered Categoricals can only compare equality or not')

Out[13]:

ERROE! Unordered Categoricals can only compare equality or not

In [14]:

# my_categories= pd.CategoricalDtype(categories=['D','D+','C-','C','C+','B-','B','B+','A-','A','A+'])
print('ERROE! Unordered Categoricals can only compare equality or not')

Out[14]:

ERROE! Unordered Categoricals can only compare equality or not

In [15]:

# my_categories= (['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],ordered=True)
print('SyntaxError: invalid syntax')

Out[15]:

SyntaxError: invalid syntax

In [16]:

grades = df['Grades'].astype(my_categories)
result = grades[(grades>'B') & (grades<'A')]
result

Out[16]:

excellent    A-
good         B+
Name: Grades, dtype: category
Categories (11, object): ['D' < 'D+' < 'C-' < 'C' ... 'B+' < 'A-' < 'A' < 'A+']

Q4

Consider the DataFrame df shown in the image below. Which of the following can return the head of the pivot table as shown in the image below df? PivotTable_ed

In [17]:

df = pd.read_csv('../resources/week-3/datasets/cwurData.csv')#[['world_rank','institution','country']]
def create_category(ranking):
    # Since the rank is just an integer, I'll just do a bunch of if/elif statements
    if (ranking >= 1) & (ranking <= 100):
        return "First Tier Top Unversity"
    elif (ranking >= 101) & (ranking <= 200):
        return "Second Tier Top Unversity"
    elif (ranking >= 201) & (ranking <= 300):
        return "Third Tier Top Unversity"
    return "Other Top Unversity"

# Now we can apply this to a single column of data to create a new series
df['Rank_Level'] = df['world_rank'].apply(lambda x: create_category(x))
df.head()

Out[17]:

In [18]:

df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.median]).head()

Out[18]:

In [19]:

df.pivot_table(values='score', index='Rank_Level', columns='country', aggfunc=[np.median]).head()

Out[19]:

In [20]:

df.pivot_table(values='score', index='Rank_Level', columns='country', aggfunc=[np.median], margins=True).head()

Out[20]:

In [21]:

# Correct
df.pivot_table(values='score', index='country', columns='Rank_Level', aggfunc=[np.median], margins=True).head()

Out[21]:

Q5

Assume that the date '11/29/2019' in MM/DD/YYYY format is the 4th day of the week, what will be the result of the following? DateFunctionality_ed

In [22]:

import pandas as pd
(pd.Timestamp('11/29/2019') + pd.offsets.MonthEnd()).weekday()

Out[22]:

5

Q6

Consider a DataFrame df. We want to create groups based on the column group_key in the DataFrame and fill the nan values with group means using:

filling_mean = lambda g: g.fillna(g.mean())

Which of the following is correct for performing this task? GroupBy_ed

In [23]:

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,None,2017],
         'Points':[876,789,863,None,741,812,None,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df

Out[23]:

In [24]:

filling_mean = lambda g: g.fillna(g.mean())
group_key='Team'

In [25]:

# df.groupby(group_key).aggregate(filling_mean)
print('ValueError: Shape of passed values is (4, 5), indices imply (3, 5)')

Out[25]:

ValueError: Shape of passed values is (4, 5), indices imply (3, 5)

In [26]:

# df.groupby(group_key).filling_mean()
print("AttributeError: 'DataFrameGroupBy' object has no attribute 'filling_mean'")

Out[26]:

AttributeError: 'DataFrameGroupBy' object has no attribute 'filling_mean'

In [27]:

df.groupby(group_key).transform(filling_mean)

Out[27]:

In [28]:

# Correct
df.groupby(group_key).apply(filling_mean)

Out[28]:

Q7

Consider the DataFrames above, both of which have a standard integer based index. Which of the following can be used to get the data of all students (from student_df) and merge it with their staff roles where nan denotes no role? MergingDataFrame_ed

In [29]:

staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 
                          'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 
                          'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 
                          'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 
                            'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 
                            'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 
                            'School': 'Engineering'}])

In [30]:

student_df

Out[30]:

In [31]:

staff_df

Out[31]:

In [32]:

pd.merge(staff_df, student_df, how='outer', on=['First Name','Last Name'])

Out[32]:

In [33]:

pd.merge(student_df, staff_df, how='inner', on=['First Name','Last Name'])

Out[33]:

In [34]:

# Correct
pd.merge(staff_df, student_df, how='right', on=['First Name','Last Name'])

Out[34]:

In [35]:

pd.merge(student_df, staff_df, how='right', on=['First Name','Last Name'])

Out[35]:

Q8

Consider a DataFrame df with columns name, reviews_per_month, and review_scores_value. This DataFrame also consists of several missing values. Which of the following can be used to: i) calculate the number of entries in the name column, and ii) calculate the mean and standard deviation of the reviews_per_month, grouping by different review_scores_value? GroupBy_ed

In [36]:

df=pd.read_csv("../resources/week-3/datasets/listings.csv")[['name', 'reviews_per_month', 'review_scores_value']]
df.head()

Out[36]:

In [37]:

df.agg({'name':len,'reviews_per_month':(np.mean,np.std)})

Out[37]:

In [38]:

df.agg({'name':len,'reviews_per_month':(np.nanmean,np.nanstd)})

Out[38]:

In [39]:

df.groupby('review_scores_value').agg({'name':len,'reviews_per_month':(np.nanmean,np.nanstd)})

Out[39]:

In [40]:

df.groupby('review_scores_value').agg({'name':len,'reviews_per_month':(np.mean,np.std)})

Out[40]:

Q9

What will be the result of the following code?: DateFunctionality_ed

In [41]:

import pandas as pd
pd.Period('01/12/2019', 'M') + 5

Out[41]:

Period('2019-06', 'M')

Q10

Which of the following is not a valid expression to create a Pandas GroupBy object from the DataFrame shown below? GroupBy_ed

In [42]:

df = pd.DataFrame([{'class': 'fruit', 'avg calories per unit': '95'},
                         {'class': 'fruit', 'avg calories per unit': '202'},
                         {'class': 'vegetable', 'avg calories per unit': '164'},
                         {'class': 'vegetable', 'avg calories per unit': None},
                         {'class': 'vegetable', 'avg calories per unit': '207'},
                        ],['apple','mango','potato','onion','broccoli'])
df

Out[42]:

In [43]:

grouped = df.groupby(['class','avg calories per unit'])
# print(grouped)
# grouped.head()
for group, frame in grouped:
    print(group)

Out[43]:

('fruit', '202')
('fruit', '95')
('vegetable', '164')
('vegetable', '207')

In [44]:

grouped = df.groupby('class')
# grouped.head()
for group, frame in grouped:
    print(group)

Out[44]:

fruit
vegetable

In [45]:

grouped = df.groupby('class',axis=0)
# grouped.head()
for group, frame in grouped:
    print(group)

Out[45]:

fruit
vegetable

In [46]:

df.groupby('vegetable')

Out[46]:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-46-a901e4cdd01b> in <module>()
----> 1 df.groupby('vegetable')

c:\users\syy\appdata\local\programs\python\python36-32\lib\site-packages\pandas\core\frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   6523             squeeze=squeeze,
   6524             observed=observed,
-> 6525             dropna=dropna,
   6526         )
   6527 
c:\users\syy\appdata\local\programs\python\python36-32\lib\site-packages\pandas\core\groupby\groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    531                 observed=observed,
    532                 mutated=self.mutated,
--> 533                 dropna=self.dropna,
    534             )
    535 
c:\users\syy\appdata\local\programs\python\python36-32\lib\site-packages\pandas\core\groupby\grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    784                 in_axis, name, level, gpr = False, None, gpr, None
    785             else:
--> 786                 raise KeyError(gpr)
    787         elif isinstance(gpr, Grouper) and gpr.key is not None:
    788             # Add key to exclusions
KeyError: 'vegetable'

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10

Product

Resources

Company