GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/lectures/lec10/lec10.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

In [ ]:

# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
%reload_ext pandas_tutor
%set_pandas_tutor_options {'projectorMode': True}
set_matplotlib_formats("svg")
plt.style.use('ggplot')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

from IPython.display import display, IFrame

def merging_animation():
    src="https://docs.google.com/presentation/d/e/2PACX-1vSk2FfJ4K_An_CQwcN_Yu5unpJckOZjVQDFqZ78ZTTMmowUsCQKKVnum0_m6TaiGquQ44E3FiS9g2Y4/embed?start=false&loop=false&delayms=60000"
    width=825
    height=500
    display(IFrame(src, width, height))

Lecture 10 – Grouping with Subgroups, Merging

DSC 10, Fall 2022

Announcements

Lab 3 is due tomorrow at 11:59PM.
Homework 3 is due on Tuesday 10/18 at 11:59PM.
The Midterm Project will be released Wednesday!
- Partners are not required, but strongly encouraged.
- Your partner doesn't have to be from your lecture section.
- Before or after discussion on Monday, we'll host a mixer to help you find a partner! More details soon.
- You must use the pair programming model when working with a partner.
If you have a conflict with your assigned discussion, email TA Dasha ([email protected]) to request to attend another.
EdStem posts:
- If it includes code or your solution, post privately.
- Otherwise, post in the designated thread for the corresponding HW/Lab question.

Agenda

Grouping with subgroups.
Merging.

Grouping with subgroups

DSC 10 student data

In [ ]:

roster = bpd.read_csv('data/roster-anon.csv')
roster

Recall, last class, we extracted the first name of each student in the class.

In [ ]:

def first_name(full_name):
    '''Returns the first name given a full name.'''
    return full_name.split(' ')[0]

In [ ]:

with_first = roster.assign(
    first=roster.get('name').apply(first_name)
)
with_first

How many students named `'Ethan'` are in each section?

We discovered that 'Ethan' is the most popular first name overall.

In [ ]:

first_counts = (with_first.groupby('first').count()
                .sort_values('name', ascending=False)
                .get(['name']))
first_counts

To find the number of 'Ethan's in each lecture section, we can query for only the rows corresponding to 'Ethan's, and then group by 'section'.

In [ ]:

with_first[with_first.get('first') == 'Ethan'].groupby('section').count()

But what if we want to know the number of 'Emily's and 'Yuen's per section, too?

In [ ]:

with_first[with_first.get('first') == 'Emily'].groupby('section').count()

In [ ]:

with_first[with_first.get('first') == 'Yuen'].groupby('section').count()

Is there a way to do this for all first names and sections all at once?

How many students with each first name does each lecture section have?

It seems like grouping would be helpful here, but currently we only know how to group by a single column.
- Right now, we can count the number of students with each first name.
- Separately, we can count the number of students in each lecture section.

Here, we want to somehow group by multiple columns.
- Specifically, we want the number of students with each first name in each lecture section.
- e.g. the number of 'Ethan's in the 1PM section, the number of 'Emily's in the 10AM section.

We can!

In [ ]:

with_first

In [ ]:

with_first.groupby(['section', 'first']).count()

The above DataFrame is telling us, for instance, that there is 1 student with the first name 'Adrian' in the 10AM section.

It is not saying that there is only 1 'Adrian' in the course overall – in fact, there are 2!

In [ ]:

with_first[with_first.get('first') == 'Adrian']

`.groupby` with subgroups

To make subgroups – that is, groups within groups – pass a list of column names to .groupby:

df.groupby([col_1, col_2, ..., col_k])

Group by col_1 first.
Within each group, group by col_2, and so on.
The resulting DataFrame has one row per unique combination of entries in the specified columns.

Notice the index... 🤔

This is called a "MultiIndex".
- The DataFrame is indexed by 'section' and 'first'.
We won't worry about the details of MultiIndexes.
We can use .reset_index() to "flatten" our DataFrame back to normal.

In [ ]:

with_first.groupby(['section', 'first']).count().reset_index()

Does order matter?

In [ ]:

with_first.groupby(['section', 'first']).count().reset_index()

In [ ]:

with_first.groupby(['first', 'section']).count().reset_index()

Answer: Kind of. The order of the rows and columns will be different, but the content will be the same.

Activity

Using counts, find the lecture section with the most 'Ryan's.

In [ ]:

counts = with_first.groupby(['section', 'first']).count().reset_index()
counts

In [ ]:

...

Activity

Using counts, find the longest first name in the class that is shared by at least two students in the same section.

Note: This was an activity in the last class. There, we had to use our shared_first_and_section function; that's not needed here!

In [ ]:

counts

In [ ]:

...

New dataset: Sea temperatures 🌊

This dataset contains the sea surface temperature in La Jolla, on many days ranging from August 22, 1916 to December 31, 2020.

In [ ]:

sea_temp = bpd.read_csv('data/sea_temp.csv')
sea_temp

Concept Check ✅ – Answer at cc.dsc10.com

We want to find the single month (e.g. November 1998) with the highest average 'SURFACE_TEMP'.

Which of the following would help us achieve this goal?

A. sea_temp.groupby('SURFACE_TEMP').mean()

B. sea_temp.groupby('MONTH').mean()

C. sea_temp.groupby(['YEAR', 'MONTH']).mean()

D. sea_temp.groupby(['MONTH', 'DAY']).mean()

E. sea_temp.groupby(['MONTH', 'SURFACE_TEMP']).mean()

In [ ]:

...

Plots of monthly and yearly average surface temperature 📈

In [ ]:

(sea_temp
 .groupby('MONTH') 
 .mean() 
 .plot(kind='line', y='SURFACE_TEMP')
);

In [ ]:

(sea_temp
 .groupby('YEAR') 
 .mean() 
 .plot(kind='line', y='SURFACE_TEMP')
);

Summary: `.groupby` with subgroups

Pass a list of columns to .groupby to make subgroups.
Use .reset_index() after grouping with subgroups to move the MultiIndex back to the columns.

Merging 🚗

In [ ]:

phones = bpd.DataFrame().assign(
    Model=['iPhone 13', 'iPhone 13 Pro Max', 'Samsung Galaxy Z Flip', 'Pixel 5a'],
    Price=[799, 1099, 999, 449],
    Screen=[6.1, 6.7, 6.7, 6.3]
)

inventory = bpd.DataFrame().assign(
    Handset=['iPhone 13 Pro Max', 'iPhone 13', 'Pixel 5a', 'iPhone 13'],
    Units=[50, 40, 10, 100],
    Store=['Westfield UTC', 'Westfield UTC', 'Fashion Valley', 'Downtown']
)

In [ ]:

# Phones on the market right now
phones

In [ ]:

# Which phones my stores have in stock in the area
inventory

Question: If I sell all of the phones in my inventory, how much will I make in revenue?

If I sell all of the phones in my inventory, how much will I make in revenue?

In [ ]:

phones.merge(inventory, left_on='Model', right_on='Handset')

What just happened!? 🤯

In [ ]:

# Click through the presentation that appears
merging_animation()

`.merge`

Pick a "left" and "right" DataFrame.
Choose a column from each to "merge on".

left_df.merge(
    right_df, 
    left_on=left_column_name,
    right_on=right_column_name
)

left_on and right_on should be column names (they don't have to be the same).
The resulting DataFrame contains a single row for every match between the two columns.
Rows in either DataFrame without a match disappear!

If I sell all of the phones in my inventory, how much will I make in revenue?

In [ ]:

%%pt

# Notice there's no Samsung Galaxy Z Flip in phones_merged
phones_merged = phones.merge(inventory, left_on='Model', right_on='Handset')

In [ ]:

(phones_merged.get('Price') * phones_merged.get('Units')).sum()

Shortcut if column names are the same: `on`

In [ ]:

inventory_relabeled = inventory.assign(Model=inventory.get('Handset')).drop(columns=['Handset'])
inventory_relabeled

In [ ]:

phones.merge(inventory_relabeled, on='Model')

Notice: There's only one column containing phone names now.

Does order matter? 🤔

In [ ]:

%%pt

inventory.merge(phones, left_on='Handset', right_on='Model')

Answer: The order of the rows and columns will be different, but the content will be the same.

What if we want to "merge on" an index?

Instead of using left_on or right_on, use left_index=True or right_index=True.

In [ ]:

phones

In [ ]:

inventory_by_handset = inventory.set_index('Handset')
inventory_by_handset

In [ ]:

phones.merge(inventory_by_handset, left_on='Model', right_index=True)

Activity setup

In [ ]:

nice_weather_cities = bpd.DataFrame().assign(
    city=['La Jolla', 'San Diego', 'Austin', 'Los Angeles'],
    state=['California', 'California', 'Texas', 'California'],
    today_high_temp=['79', '83', '87', '87']
    
)

schools = bpd.DataFrame().assign(
    name=['UCSD', 'University of Chicago', 'University of San Diego','Johns Hopkins University', 'UT Austin', 'SDSU', 'UCLA'], 
    city=['La Jolla', 'Chicago', 'San Diego', 'Baltimore', 'Austin', 'San Diego', 'Los Angeles'],
    state=['California', 'Illinois', 'California', 'Maryland', 'Texas', 'California', 'California'],
    graduation_rate=[0.87, 0.94, 0.78, 0.92, 0.81, 0.83, 0.91 ]
)

Concept Check ✅ – Answer at cc.dsc10.com

Without writing code, how many rows are in nice_weather_cities.merge(schools, on='city')?

A. 4    B. 5    C. 6    D. 7    E. 8

In [ ]:

nice_weather_cities

In [ ]:

schools

In [ ]:

Followup activity

Without writing code, how many rows are in nice_weather_cities.merge(schools, on='state')?

Hint: It's more than you might guess at first!

In [ ]:

%%pt

nice_weather_cities.merge(schools, on='state')

In [ ]:

nice_weather_cities.merge(schools, on='state').shape[0]

Summary, next time

Summary

To create groups within a group, pass a list to .groupby.
- The result has one row for every unique combination of elements in the specified columns.
To combine information from multiple DataFrames, use .merge.
- When using .merge, Python searches for a match between a specified column in each DataFrame and combines the rows with a match.
- If there are no matches, the row disappears!

Next time

If-statements, to execute code only when certain conditions are met.
For-loops, to repeat code many times.
Both are foundational programming tools. 🛠

Lecture 10 – Grouping with Subgroups, Merging

DSC 10, Fall 2022

Announcements

Agenda

Grouping with subgroups

DSC 10 student data

How many students named `'Ethan'` are in each section?

How many students with each first name does each lecture section have?

`.groupby` with subgroups

Notice the index... 🤔

Does order matter?

Activity

Activity

New dataset: Sea temperatures 🌊

Concept Check ✅ – Answer at cc.dsc10.com

Plots of monthly and yearly average surface temperature 📈

Summary: `.groupby` with subgroups

Merging 🚗

If I sell all of the phones in my inventory, how much will I make in revenue?

What just happened!? 🤯

`.merge`

If I sell all of the phones in my inventory, how much will I make in revenue?

Shortcut if column names are the same: `on`

Does order matter? 🤔

What if we want to "merge on" an index?

Activity setup

Concept Check ✅ – Answer at cc.dsc10.com

Followup activity

Summary, next time

Summary

Next time

Product

Resources

Company

Lecture 10 – Grouping with Subgroups, Merging

DSC 10, Fall 2022

Announcements

Agenda

Grouping with subgroups

DSC 10 student data

How many students named 'Ethan' are in each section?

How many students with each first name does each lecture section have?

.groupby with subgroups

Notice the index... 🤔

Does order matter?

Activity

Activity

New dataset: Sea temperatures 🌊

Concept Check ✅ – Answer at cc.dsc10.com

Plots of monthly and yearly average surface temperature 📈

Summary: .groupby with subgroups

Merging 🚗

If I sell all of the phones in my inventory, how much will I make in revenue?

What just happened!? 🤯

.merge

If I sell all of the phones in my inventory, how much will I make in revenue?

Shortcut if column names are the same: on

Does order matter? 🤔

What if we want to "merge on" an index?

Activity setup

Concept Check ✅ – Answer at cc.dsc10.com

Followup activity

Summary, next time

Summary

Next time

How many students named `'Ethan'` are in each section?

`.groupby` with subgroups

Summary: `.groupby` with subgroups

`.merge`

Shortcut if column names are the same: `on`