Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
dsc-courses
GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/lectures/lec06/lec06.ipynb
3058 views
Kernel: Python 3 (ipykernel)
# Set up packages for lecture. Don't worry about understanding this code, but # make sure to run it if you're following along. import numpy as np import babypandas as bpd %reload_ext pandas_tutor %set_pandas_tutor_options {'projectorMode': True} import matplotlib.pyplot as plt from matplotlib_inline.backend_inline import set_matplotlib_formats set_matplotlib_formats("svg") plt.style.use('ggplot')

Lecture 6 – More Queries and GroupBy

DSC 10, Fall 2022

Announcements

  • Lab 2 is due on Saturday 10/8 at 11:59pm.

  • Homework 2 is due on Tuesday 10/11 at 11:59pm.

  • Avoid submission errors. You are responsible for submitting your work in the right format to Gradescope by the deadline.

  • Discussion solutions are posted on Monday evenings to practice.dsc10.com.

Agenda

  • Recap: queries.

  • Queries with multiple conditions.

  • GroupBy.

  • Extra practice, including challenge problems.

Resources:

About the Data: Get It Done service requests πŸ‘·

The requests DataFrame contains a summary of all service requests so far in 2022, broken down by neighborhood and service.

requests = bpd.read_csv('data/get-it-done-requests.csv') requests = requests.assign(total=requests.get('closed') + requests.get('open')) requests

Recap: queries

What is a query? πŸ€”

  • A "query" is code that extracts rows from a DataFrame for which certain condition(s) are true.

  • We often use queries to filter DataFrames so that they only contain the rows that satisfy the conditions stated in our questions.

How do we query a DataFrame?

To select only certain rows of requests:

  1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss), usually by making a comparison.

  2. Then pass it into requests[sequence_goes_here].

Element-wise comparisons

There are several types of comparisons we can make.

symbolmeaning
==equal to
!=not equal to
<less than
<=less than or equal to
>greater than
>=greater than or equal to

Example 6: Which neighborhood has the most 'Weed Cleanup' requests?

Key concept: Selecting rows (via Boolean indexing).

Strategy

  1. Query to extract a DataFrame of just the 'Weed Cleanup' requests.

  2. Sort by 'total' in descending order.

  3. Extract the first element from the 'neighborhood' column.

weed_cleanup_only = requests[requests.get('service') == 'Weed Cleanup'] weed_cleanup_only
weed_cleanup_sorted = weed_cleanup_only.sort_values(by='total', ascending=False) weed_cleanup_sorted
weed_cleanup_sorted.get('neighborhood').iloc[0]

What if the condition isn't satisfied?

requests[requests.get('service') == 'Lime Cleanup']

Concept Check βœ… – Answer at cc.dsc10.com

Which expression below evaluates to the total number of service requests in the 'Downtown' neighborhood?

A. requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()

B. requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']

C. requests['Downtown'].get('total').sum()

D. More than one of the above.

...

Activity 🚘

Question: What is the most commonly requested service in the 'University' neighborhood (near UCSD)?

Write one line of code that evaluates to the answer.

...

Example 7: How many service requests were for 'Pothole' or 'Pavement Maintenance'?

Key concept: Queries with multiple conditions.

Multiple conditions

  • To write a query with multiple conditions, use & for "and" and | for "or".

  • You must use (parentheses) around each condition!

  • ⚠️ Don't use the Python keywords and and or here! They do not behave as you'd want.

requests[(requests.get('service') == 'Pothole') | (requests.get('service') == 'Pavement Maintenance')]
# You can add line breaks within brackets or parentheses requests[(requests.get('service') == 'Pothole') | (requests.get('service') == 'Pavement Maintenance')]

The & and | operators work element-wise

(requests.get('service') == 'Pothole')
(requests.get('service') == 'Pavement Maintenance')
(requests.get('service') == 'Pothole') | (requests.get('service') == 'Pavement Maintenance')

Original Question: How many service requests were for 'Pothole' or 'Pavement Maintenance'?

requests[(requests.get('service') == 'Pothole') | (requests.get('service') == 'Pavement Maintenance')].get('total').sum()

Concept Check βœ… – Answer at cc.dsc10.com

Each of the following questions can be answered by querying the requests DataFrame.

  1. Which neighborhood had the most 'Street Flooded' requests?

  2. In the 'Kearny Mesa' neighborhood, how many different types of services have open requests?

  3. How many requests have been closed in the 'La Jolla' neighborhood?

How many of the questions above require the query to have multiple conditions?

A. 0 Β Β Β Β Β Β Β Β Β Β Β Β  B. 1 Β Β Β Β Β Β Β Β Β Β Β Β  C. 2 Β Β Β Β Β Β Β Β Β Β Β Β  D. 3

Bonus: Try to write the code to answer each question.

...

Example 8: Which neighborhood had the most Get It Done requests?

Key concept: Grouping by one column.

Organizing requests by neighborhood

  • We can find the total number of Get It Done requests for any one neighborhood.

    • For example, requests[requests.get('neighborhood') == 'Carmel Valley'].get('total').sum().

  • But how can we find the total requests for every neighborhood at the same time?

requests[requests.get('neighborhood') == 'Carmel Valley'].get('total').sum()
requests[requests.get('neighborhood') == 'Torrey Hills'].get('total').sum()

It seems like there has to be a better way. And there is!

GroupBy: Split, aggregate, and combine

Observe what happens when we use the .groupby method on requests with the argument 'neighborhood'.

requests.groupby('neighborhood').sum()

Note that the 'total' counts for Carmel Valley and Torrey Hills are the same as we saw on the previous slide. What just happened? 🀯

An illustrative example: Pets 🐱 🐢🐹

Consider the DataFrame pets containing pet species, colors, and weights.

pets = bpd.DataFrame().assign( Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'], Color=['black', 'golden', 'black', 'white', 'golden', 'golden'], Weight=[40, 15, 20, 80, 25, 1], Age=[5, 8, 9, 2, 0.5, 3] ) pets

Visualizing pets.groupby('Species').mean()

  1. Split the rows of pets into "groups" according to their values in the 'Species' column.

  2. Aggregate the rows with the same value of 'Species' by taking the mean of all numerical columns.

  3. Combine these means into a new DataFrame that is indexed by 'Species' and sorted by 'Species' in ascending order.

Note that the result contains just one row for cats, one row for dogs, and one row for hamsters!

%%pt pets.groupby('Species').mean()

Pandas Tutor

  • In the last cell, we saw not just the output of the code, but a visualization of the inner workings of the code.

  • This is thanks to Pandas Tutor, a new tool developed by Sam Lau, who taught this course over the summer.

  • Pandas Tutor draws diagrams to explain pandas (and babypandas) code.

  • Add %%pt to the top of a code cell to explain the last line of babypandas code.

    • This requires Pandas Tutor to be imported, which we already did in this notebook.

  • You can also use Pandas Tutor through its website, pandastutor.com.

# Without Pandas Tutor pets.groupby('Species').mean()
%%pt # With Pandas Tutor pets.groupby('Species').mean()

Back to Get It Done service requests πŸ‘·

requests
requests.groupby('neighborhood').sum()

Using .groupby in general

In short, .groupby aggregates all rows with the same value in a specified column (e.g. 'neighborhood') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values.

  1. Choose a column to group by.

    • .groupby(column_name) will gather rows which have the same value in the specified column (column_name).

    • On the previous slide, we grouped by 'neighborhood'.

    • In the resulting DataFrame, there was one row for every unique value of 'neighborhood'.

  2. Choose an aggregation method.

    • The aggregation method will be applied within each group.

    • On the previous slide, we applied the .sum() method to every 'neighborhood'.

    • The aggregation method is applied individually to each column (e.g. the sums were computed separately for 'closed', 'open', and 'total').

      • If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output – we'll look at this in more detail shortly.

    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

Observation #1

  • The index has changed to neighborhood names.

  • In general, the new row labels are the group labels (i.e., the unique values in the column that we grouped on).

requests
requests.groupby('neighborhood').sum()

Observation #2

The 'service' column has disappeared. Why?

requests
requests.groupby('neighborhood').sum()

Disappearing columns βœ¨πŸ‡πŸŽ©

  • The aggregation method – .sum(), in this case – is applied to each column.

  • If it doesn't make sense to apply it to a particular column, that column will disappear.

  • For instance, we can't sum strings, like in the 'service' column.

  • However, we can compute the max of several strings. How?

# Can you guess how the max position is determined? requests.groupby('neighborhood').max()

Observation #3

  • The aggregation method is applied to each column separately.

  • The rows of the resulting DataFrame need to be interpreted with care.

requests.groupby('neighborhood').max()
  • Have there been 1703 'Weed Cleanup' requests in Balboa Park so far this year?

  • Why isn't the 'total' column equal to the sum of the 'closed' and 'open' columns, as it originally was?

Activity

Write a line of code that evaluates to the service that has been requested in the fewest number of neighborhoods.

Strategy:

  1. Choose a column to group by.

  2. Choose an aggregation method. Some common ones are .count(), .sum(), .mean(), .median(), .max(), and .min().

  3. Sort the resulting DataFrame and extract the relevant piece of information (the service name).

...

Note: On an assignment, in questions like this with multiple correct answers (a tie for the fewest), you can put any one of them. We'll accept any correct answer.

Observation #4

  • The column names of the output of .groupby don't make sense when using the .count() aggregation method.

  • Consider dropping unneeded columns and renaming columns as follows:

    1. Use .assign to create a new column containing the same values as the old column(s).

    2. Use .drop(columns=list_of_column_labels) to drop the old column(s).

num_neighborhoods = requests.groupby('service').count() num_neighborhoods
num_neighborhoods = num_neighborhoods.assign( neighborhoods_requesting=num_neighborhoods.get('total') ).drop(columns=['neighborhood', 'closed', 'open', 'total']) num_neighborhoods

More practice: IMDb dataset 🎞️

imdb = bpd.read_csv('data/imdb.csv').set_index('Title').sort_values(by='Rating') imdb

Question: How many movies appear from each decade?

imdb.groupby('Decade').count()
# We'll learn how to make plots like this in the next lecture! imdb.groupby('Decade').count().plot(y='Year');

Question: What was the highest rated movie of the 1990s?

Let's try to do this two different ways.

Without grouping

%%pt imdb[imdb.get('Decade') == 1990].sort_values('Rating', ascending=False).index[0]

Note: The command to extract the index of a DataFrame is .index - no parentheses! This is different than the way we extract columns, with .get(), because the index is not a column.

With grouping

%%pt imdb.reset_index().groupby('Decade').max()
  • It turns out that this method does not yield the correct answer.

  • When we use an aggregation method (e.g. .max()), aggregation is done to each column individually.

  • While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is not Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.

  • Taking the max is not helpful here.

Challenge problems

We won't cover these problems in class, but they're here for you to practice with some harder examples. To access the solutions, you'll need to watch this solution walkthrough video (start at 10:00).

Before watching the video, make sure to try these problems on your own – they're great prep for homeworks, projects, and exams!

Question: How many years have more than 3 movies rated above 8.5?

Aside: Using .sum() on a Boolean array/Series

  • Summing a Boolean array/Series gives a count of the number of True elements. This is because Python treats True as 1 and False as 0.

  • Can you use that fact here?

Question: Out of the years with more than 3 movies, which had the highest average rating?

Question: Which year had the longest movie titles, on average?

Hint: Use .str.len() on the column or index that contains the names of the movies.

Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?

Summary, next time

Summary

  • We can write queries that involve multiple conditions, as long as we:

    • Put parentheses around all conditions.

    • Separate conditions using & if you require all to be true, or | if you require at least one to be true.

  • The method call df.groupby(column_name).agg_method() aggregates all rows with the same value for column_name into a single row in the resulting DataFrame, using agg_method() to combine values.

    • Aggregation methods to know: .count(), .sum(), .mean(), .median(), .max(), and .min().

Next time

A picture is worth a 1000 words – it's time to visualize!