Using groupby to summarize information
Import some data from the Seaborn sample data sets, and do some exploration.
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa | ford torino |
Let's work with only cars made in the USA.
Look at what possible values are in the
origin
column.Create a new
DataFrame
namedusa
with only those rows inmpg
for which the value of theorigin
column is usa.Verify there is only 1 value in the
origin
column in `usa.
Now let's specify that we want to create some groupings based on the model_year
attribute and see what is contained in the groups
property of the DataFrameGroupBy
generated by calling groupby
.
Let's see what the values associated with the key 70 in grouped.groups
are.
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa | ford torino |
5 | 15.0 | 8 | 429.0 | 198.0 | 4341 | 10.0 | 70 | usa | ford galaxie 500 |
6 | 14.0 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 70 | usa | chevrolet impala |
7 | 14.0 | 8 | 440.0 | 215.0 | 4312 | 8.5 | 70 | usa | plymouth fury iii |
8 | 14.0 | 8 | 455.0 | 225.0 | 4425 | 10.0 | 70 | usa | pontiac catalina |
9 | 15.0 | 8 | 390.0 | 190.0 | 3850 | 8.5 | 70 | usa | amc ambassador dpl |
10 | 15.0 | 8 | 383.0 | 170.0 | 3563 | 10.0 | 70 | usa | dodge challenger se |
11 | 14.0 | 8 | 340.0 | 160.0 | 3609 | 8.0 | 70 | usa | plymouth 'cuda 340 |
12 | 15.0 | 8 | 400.0 | 150.0 | 3761 | 9.5 | 70 | usa | chevrolet monte carlo |
13 | 14.0 | 8 | 455.0 | 225.0 | 3086 | 10.0 | 70 | usa | buick estate wagon (sw) |
15 | 22.0 | 6 | 198.0 | 95.0 | 2833 | 15.5 | 70 | usa | plymouth duster |
16 | 18.0 | 6 | 199.0 | 97.0 | 2774 | 15.5 | 70 | usa | amc hornet |
17 | 21.0 | 6 | 200.0 | 85.0 | 2587 | 16.0 | 70 | usa | ford maverick |
24 | 21.0 | 6 | 199.0 | 90.0 | 2648 | 15.0 | 70 | usa | amc gremlin |
25 | 10.0 | 8 | 360.0 | 215.0 | 4615 | 14.0 | 70 | usa | ford f250 |
26 | 10.0 | 8 | 307.0 | 200.0 | 4376 | 15.0 | 70 | usa | chevy c20 |
Remember loc
uses index values, not list positions. Let's prove that a minute. First we'll change the index values.
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
100 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
101 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
102 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
103 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst |
104 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa | ford torino |
Now re-compute the groups based on the model_year
attribute.
This hopefully proves that the values in the groups
dictionary represent index values.
Now let's look at grouping by multiple attributes; in particular, we'll group by model_year
followed by cylinders
.
Now we could get the members of the 6-cylinder vehicles manufactured in 1970 like this
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
15 | 22.0 | 6 | 198.0 | 95.0 | 2833 | 15.5 | 70 | usa | plymouth duster |
16 | 18.0 | 6 | 199.0 | 97.0 | 2774 | 15.5 | 70 | usa | amc hornet |
17 | 21.0 | 6 | 200.0 | 85.0 | 2587 | 16.0 | 70 | usa | ford maverick |
24 | 21.0 | 6 | 199.0 | 90.0 | 2648 | 15.0 | 70 | usa | amc gremlin |
And we can get the average mpg for that specific set of vehicles by finding them using loc
, and then using the Series
method mean
to compute the average
We can get the average mpg
for every (model_year, cylinders)
combination using the DataFrameGroupBy
object's mean
method.
mpg | displacement | horsepower | weight | acceleration | ||
---|---|---|---|---|---|---|
model_year | cylinders | |||||
70 | 6 | 20.500000 | 199.000000 | 91.750000 | 2710.500000 | 15.500000 |
8 | 14.111111 | 367.555556 | 183.666667 | 3940.055556 | 11.194444 | |
71 | 4 | 24.800000 | 118.200000 | 79.500000 | 2178.600000 | 17.600000 |
6 | 18.000000 | 243.375000 | 98.875000 | 3171.875000 | 14.750000 | |
8 | 13.428571 | 371.714286 | 166.857143 | 4537.714286 | 12.214286 |
The index of the DateFrame
that comes back from DataFrameGroupBy.mean
is a multi-level index.
Here's an example from our book that illustrates an important idea when passing a Series
object as the field to group by. Consider the following DataFrame
named df
.
key1 | key2 | data1 | data2 | |
---|---|---|---|---|
0 | a | one | 1.628402 | 0.630337 |
1 | a | two | -1.107394 | 0.022734 |
2 | b | one | 0.475426 | -0.792190 |
3 | b | two | -1.547331 | -1.789632 |
4 | a | one | 1.078556 | -1.490519 |
The book uses the following code to compute the average of the data1
column, grouped by the values in the key1
column.
Another way to write this is as follows:
Look carefully at the first example again, particularly focusing on what's being grouped:
Does this raise any questions in your head?
Consider the code in the following 2 cells.
What gives? The reason the original code works is because the index values are the same.
In the version above:
there's no match in key1
from the index values in df['data1']
, so we don't get any groups.
Now let's look at grouping by a function. The function will be passed values of the index, so we'll transform our mpg
dataset by using the name
column values as the index.
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | |
---|---|---|---|---|---|---|---|---|
name | ||||||||
chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa |
buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa |
plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa |
amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa |
ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa |
Now we can group by just the vehicle's make by creating a function that extracts the make from each index value.
We see some problems here:
chevroelt should be chevrolet
chevy is an abbreviation for chevrolet.
A lambda function can only have one expression within it. We'll write an actual function instead to allow us to treat chevrolet, chevroelt, and chevy the same.
Let's pass the get_make
function as the argument to groupby
and look at the resulting groups.
Now find the average mpg
for each make based on the keys returned by the get_make
function.
Get the average miles per gallon for the 3 makes that appear the most often in our data set.
To get started, sort the result of calling the size
aggregate function on grouped_by_make
in descending order, storing the result in a variable named most_frequently_occurring
.
Now use the index values from most_frequently_occurring
to select elements from mean_mpg_by_make
.
We can also create a new DataFrame
from the Series
objects mean_mpg_by_make
and size_by_make
, sort it using sort_values
, and get the first 3 elements using iloc
.
Finally, look at summarizing by Make
and Year
. Get started by creating a multi-level index on those 2 columns.
Now we can group by a combination of the car's make and year
And find the average miles per gallon for all vehicles manufactured by Ford in 1971.