Lecture 6 β More Queries and GroupBy
DSC 10, Fall 2022
Announcements
Lab 2 is due on Saturday 10/8 at 11:59pm.
Homework 2 is due on Tuesday 10/11 at 11:59pm.
Avoid submission errors. You are responsible for submitting your work in the right format to Gradescope by the deadline.
If you're unable to submit on time, take a look at our Deadlines and Slip Days policy.
Discussion solutions are posted on Monday evenings to practice.dsc10.com.
Agenda
Recap: queries.
Queries with multiple conditions.
GroupBy.
Extra practice, including challenge problems.
Resources:
About the Data: Get It Done service requests π·
The requests DataFrame contains a summary of all service requests so far in 2022, broken down by neighborhood and service.
Recap: queries
What is a query? π€
A "query" is code that extracts rows from a DataFrame for which certain condition(s) are true.
We often use queries to filter DataFrames so that they only contain the rows that satisfy the conditions stated in our questions.
How do we query a DataFrame?
To select only certain rows of requests:
Make a sequence (list/array/Series) of
Trues (keep) andFalses (toss), usually by making a comparison.Then pass it into
requests[sequence_goes_here].
Element-wise comparisons
There are several types of comparisons we can make.
| symbol | meaning |
|---|---|
== | equal to |
!= | not equal to |
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
Example 6: Which neighborhood has the most 'Weed Cleanup' requests?
Key concept: Selecting rows (via Boolean indexing).
Strategy
Query to extract a DataFrame of just the
'Weed Cleanup'requests.Sort by
'total'in descending order.Extract the first element from the
'neighborhood'column.
What if the condition isn't satisfied?
Concept Check β β Answer at cc.dsc10.com
Which expression below evaluates to the total number of service requests in the 'Downtown' neighborhood?
A. requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()
B. requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']
C. requests['Downtown'].get('total').sum()
D. More than one of the above.
Activity π
Question: What is the most commonly requested service in the 'University' neighborhood (near UCSD)?
Write one line of code that evaluates to the answer.
Example 7: How many service requests were for 'Pothole' or 'Pavement Maintenance'?
Key concept: Queries with multiple conditions.
Multiple conditions
To write a query with multiple conditions, use
&for "and" and|for "or".You must use
(parentheses)around each condition!β οΈ Don't use the Python keywords
andandorhere! They do not behave as you'd want.See BPD 10.3 for an explanation.
The & and | operators work element-wise
Original Question: How many service requests were for 'Pothole' or 'Pavement Maintenance'?
Concept Check β β Answer at cc.dsc10.com
Each of the following questions can be answered by querying the requests DataFrame.
Which neighborhood had the most
'Street Flooded'requests?In the
'Kearny Mesa'neighborhood, how many different types of services have open requests?How many requests have been closed in the
'La Jolla'neighborhood?
How many of the questions above require the query to have multiple conditions?
A. 0 Β Β Β Β Β Β Β Β Β Β Β Β B. 1 Β Β Β Β Β Β Β Β Β Β Β Β C. 2 Β Β Β Β Β Β Β Β Β Β Β Β D. 3
Bonus: Try to write the code to answer each question.
Example 8: Which neighborhood had the most Get It Done requests?
Key concept: Grouping by one column.
Organizing requests by neighborhood
We can find the total number of Get It Done requests for any one neighborhood.
For example,
requests[requests.get('neighborhood') == 'Carmel Valley'].get('total').sum().
But how can we find the total requests for every neighborhood at the same time?
It seems like there has to be a better way. And there is!
GroupBy: Split, aggregate, and combine
Observe what happens when we use the .groupby method on requests with the argument 'neighborhood'.
Note that the 'total' counts for Carmel Valley and Torrey Hills are the same as we saw on the previous slide. What just happened? π€―
An illustrative example: Pets π± πΆπΉ
Consider the DataFrame pets containing pet species, colors, and weights.
Visualizing pets.groupby('Species').mean()
Split the rows of
petsinto "groups" according to their values in the'Species'column.Aggregate the rows with the same value of
'Species'by taking themeanof all numerical columns.Combine these means into a new DataFrame that is indexed by
'Species'and sorted by'Species'in ascending order.
Note that the result contains just one row for cats, one row for dogs, and one row for hamsters!
Pandas Tutor
In the last cell, we saw not just the output of the code, but a visualization of the inner workings of the code.
This is thanks to Pandas Tutor, a new tool developed by Sam Lau, who taught this course over the summer.
Pandas Tutor draws diagrams to explain
pandas(andbabypandas) code.Add
%%ptto the top of a code cell to explain the last line ofbabypandascode.This requires Pandas Tutor to be imported, which we already did in this notebook.
You can also use Pandas Tutor through its website, pandastutor.com.
Back to Get It Done service requests π·
Using .groupby in general
In short, .groupby aggregates all rows with the same value in a specified column (e.g. 'neighborhood') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values.
Choose a column to group by.
.groupby(column_name)will gather rows which have the same value in the specified column (column_name).On the previous slide, we grouped by
'neighborhood'.In the resulting DataFrame, there was one row for every unique value of
'neighborhood'.
Choose an aggregation method.
The aggregation method will be applied within each group.
On the previous slide, we applied the
.sum()method to every'neighborhood'.The aggregation method is applied individually to each column (e.g. the sums were computed separately for
'closed','open', and'total').If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output β we'll look at this in more detail shortly.
Common aggregation methods include
.count(),.sum(),.mean(),.median(),.max(), and.min().
Observation #1
The index has changed to neighborhood names.
In general, the new row labels are the group labels (i.e., the unique values in the column that we grouped on).
Observation #2
The 'service' column has disappeared. Why?
Disappearing columns β¨ππ©
The aggregation method β
.sum(), in this case β is applied to each column.If it doesn't make sense to apply it to a particular column, that column will disappear.
For instance, we can't sum strings, like in the
'service'column.However, we can compute the max of several strings. How?
Observation #3
The aggregation method is applied to each column separately.
The rows of the resulting DataFrame need to be interpreted with care.
Have there been 1703
'Weed Cleanup'requests in Balboa Park so far this year?Why isn't the
'total'column equal to the sum of the'closed'and'open'columns, as it originally was?
Activity
Write a line of code that evaluates to the service that has been requested in the fewest number of neighborhoods.
Strategy:
Choose a column to group by.
Choose an aggregation method. Some common ones are
.count(),.sum(),.mean(),.median(),.max(), and.min().Sort the resulting DataFrame and extract the relevant piece of information (the service name).
Note: On an assignment, in questions like this with multiple correct answers (a tie for the fewest), you can put any one of them. We'll accept any correct answer.
Observation #4
The column names of the output of
.groupbydon't make sense when using the.count()aggregation method.Consider dropping unneeded columns and renaming columns as follows:
Use
.assignto create a new column containing the same values as the old column(s).Use
.drop(columns=list_of_column_labels)to drop the old column(s).
More practice: IMDb dataset ποΈ
Question: How many movies appear from each decade?
Question: What was the highest rated movie of the 1990s?
Let's try to do this two different ways.
Without grouping
Note: The command to extract the index of a DataFrame is .index - no parentheses! This is different than the way we extract columns, with .get(), because the index is not a column.
With grouping
It turns out that this method does not yield the correct answer.
When we use an aggregation method (e.g.
.max()), aggregation is done to each column individually.While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is not Unforgiven β instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
Taking the
maxis not helpful here.
Challenge problems
We won't cover these problems in class, but they're here for you to practice with some harder examples. To access the solutions, you'll need to watch this solution walkthrough video (start at 10:00).
Before watching the video, make sure to try these problems on your own β they're great prep for homeworks, projects, and exams!
Question: How many years have more than 3 movies rated above 8.5?
Aside: Using .sum() on a Boolean array/Series
Summing a Boolean array/Series gives a count of the number of
Trueelements. This is because Python treatsTrueas 1 andFalseas 0.Can you use that fact here?
Question: Out of the years with more than 3 movies, which had the highest average rating?
Question: Which year had the longest movie titles, on average?
Hint: Use .str.len() on the column or index that contains the names of the movies.
Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?
Summary, next time
Summary
We can write queries that involve multiple conditions, as long as we:
Put parentheses around all conditions.
Separate conditions using
&if you require all to be true, or|if you require at least one to be true.
The method call
df.groupby(column_name).agg_method()aggregates all rows with the same value forcolumn_nameinto a single row in the resulting DataFrame, usingagg_method()to combine values.Aggregation methods to know:
.count(),.sum(),.mean(),.median(),.max(), and.min().
Next time
A picture is worth a 1000 words β it's time to visualize!