Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Project: CSCI 195
Views: 5930
Image: ubuntu2004
Kernel: Python 3 (system-wide)

Using groupby to summarize information

Import some data from the Seaborn sample data sets, and do some exploration.

import pandas as pd import numpy as np import seaborn as sns mpg = sns.load_dataset('mpg') mpg.head()
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino

Let's work with only cars made in the USA.

  • Look at what possible values are in the origin column.

  • Create a new DataFrame named usa with only those rows in mpg for which the value of the origin column is usa.

  • Verify there is only 1 value in the origin column in `usa.

mpg['origin'].unique() usa = mpg[mpg.loc[:, 'origin'] == 'usa'] print(usa['origin'].unique()) print(usa.head())
['usa'] mpg cylinders displacement horsepower weight acceleration \ 0 18.0 8 307.0 130.0 3504 12.0 1 15.0 8 350.0 165.0 3693 11.5 2 18.0 8 318.0 150.0 3436 11.0 3 16.0 8 304.0 150.0 3433 12.0 4 17.0 8 302.0 140.0 3449 10.5 model_year origin name 0 70 usa chevrolet chevelle malibu 1 70 usa buick skylark 320 2 70 usa plymouth satellite 3 70 usa amc rebel sst 4 70 usa ford torino

Now let's specify that we want to create some groupings based on the model_year attribute and see what is contained in the groups property of the DataFrameGroupBy generated by calling groupby.

grouped = usa.groupby('model_year') grouped.groups
{70: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 24, 25, 26, 27, 28], 71: [30, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 56], 72: [58, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75, 80, 83], 73: [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 104, 105, 106, 107, 109, 112, 113, 115, 116, 121, 124], 74: [125, 126, 127, 128, 130, 132, 133, 134, 135, 136, 137, 138, 139, 140, 146], 75: [152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 168, 169, 170, 174, 176], 76: [184, 185, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 199, 200, 201, 202, 206, 208, 212, 213, 214, 215], 77: [217, 219, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 234, 236, 237, 238], 78: [245, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 269, 271, 272], 79: [280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 295, 296, 298, 300, 301, 302, 305, 306, 307, 308], 80: [311, 313, 314, 315, 316, 323, 336], 81: [338, 339, 340, 341, 342, 344, 350, 351, 352, 363, 364, 365, 366], 82: [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 386, 387, 388, 389, 391, 392, 393, 395, 396, 397]}

Let's see what the values associated with the key 70 in grouped.groups are.

usa.loc[grouped.groups[70]].head(20)
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
5 15.0 8 429.0 198.0 4341 10.0 70 usa ford galaxie 500
6 14.0 8 454.0 220.0 4354 9.0 70 usa chevrolet impala
7 14.0 8 440.0 215.0 4312 8.5 70 usa plymouth fury iii
8 14.0 8 455.0 225.0 4425 10.0 70 usa pontiac catalina
9 15.0 8 390.0 190.0 3850 8.5 70 usa amc ambassador dpl
10 15.0 8 383.0 170.0 3563 10.0 70 usa dodge challenger se
11 14.0 8 340.0 160.0 3609 8.0 70 usa plymouth 'cuda 340
12 15.0 8 400.0 150.0 3761 9.5 70 usa chevrolet monte carlo
13 14.0 8 455.0 225.0 3086 10.0 70 usa buick estate wagon (sw)
15 22.0 6 198.0 95.0 2833 15.5 70 usa plymouth duster
16 18.0 6 199.0 97.0 2774 15.5 70 usa amc hornet
17 21.0 6 200.0 85.0 2587 16.0 70 usa ford maverick
24 21.0 6 199.0 90.0 2648 15.0 70 usa amc gremlin
25 10.0 8 360.0 215.0 4615 14.0 70 usa ford f250
26 10.0 8 307.0 200.0 4376 15.0 70 usa chevy c20

Remember loc uses index values, not list positions. Let's prove that a minute. First we'll change the index values.

new_index = pd.RangeIndex(100, len(usa)+100) usa_with_new_index = usa.set_index(new_index) usa_with_new_index.head() #print(len(usa))
mpg cylinders displacement horsepower weight acceleration model_year origin name
100 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
101 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
102 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
103 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
104 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino

Now re-compute the groups based on the model_year attribute.

new_index_grouped = usa_with_new_index.groupby('model_year') new_index_grouped.groups
{70: [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121], 71: [122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141], 72: [142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159], 73: [160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188], 74: [189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203], 75: [204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223], 76: [224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245], 77: [246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263], 78: [264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285], 79: [286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308], 80: [309, 310, 311, 312, 313, 314, 315], 81: [316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328], 82: [329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348]}

This hopefully proves that the values in the groups dictionary represent index values.

Now let's look at grouping by multiple attributes; in particular, we'll group by model_year followed by cylinders.

grouped = usa.groupby(['model_year', 'cylinders']) grouped.groups
{(70, 6): [15, 16, 17, 24], (70, 8): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 25, 26, 27, 28], (71, 4): [30, 32, 46, 49, 56], (71, 6): [33, 34, 35, 36, 37, 45, 47, 48], (71, 8): [38, 39, 40, 41, 42, 43, 44], (72, 4): [58, 60, 61, 80, 83], (72, 8): [62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75], (73, 4): [109, 112], (73, 6): [97, 98, 99, 100, 101, 107, 113], (73, 8): [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 103, 104, 105, 106, 115, 116, 121, 124], (74, 4): [130, 132, 146], (74, 6): [125, 126, 127, 128, 133, 134, 135], (74, 8): [136, 137, 138, 139, 140], (75, 4): [168, 170], (75, 6): [152, 153, 154, 155, 160, 161, 162, 163, 164, 169, 174, 176], (75, 8): [156, 157, 158, 159, 165, 166], (76, 4): [184, 185, 195, 196, 206], (76, 6): [191, 192, 193, 194, 199, 200, 201, 202], (76, 8): [187, 188, 189, 190, 208, 212, 213, 214, 215], (77, 4): [217, 219, 234, 236, 237, 238], (77, 6): [225, 226, 227, 228], (77, 8): [221, 222, 223, 224, 229, 230, 231, 232], (78, 4): [245, 255, 266, 269, 271, 272], (78, 6): [252, 253, 254, 256, 257, 258, 259, 260, 261, 263], (78, 8): [249, 250, 251, 262, 264, 265], (79, 4): [282, 295, 296, 301, 302, 305, 308], (79, 6): [280, 281, 283, 284, 306, 307], (79, 8): [285, 286, 287, 288, 289, 290, 291, 292, 298, 300], (80, 4): [311, 313, 314, 315, 323, 336], (80, 6): [316], (81, 4): [338, 339, 340, 342, 344, 350, 351, 352], (81, 6): [341, 363, 365, 366], (81, 8): [364], (82, 4): [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 388, 391, 392, 393, 395, 396, 397], (82, 6): [386, 387, 389]}

Now we could get the members of the 6-cylinder vehicles manufactured in 1970 like this

usa.loc[grouped.groups[(70,6)]]
mpg cylinders displacement horsepower weight acceleration model_year origin name
15 22.0 6 198.0 95.0 2833 15.5 70 usa plymouth duster
16 18.0 6 199.0 97.0 2774 15.5 70 usa amc hornet
17 21.0 6 200.0 85.0 2587 16.0 70 usa ford maverick
24 21.0 6 199.0 90.0 2648 15.0 70 usa amc gremlin

And we can get the average mpg for that specific set of vehicles by finding them using loc, and then using the Series method mean to compute the average

usa.loc[grouped.groups[(70,6)],["mpg", "horsepower"]].mean()
mpg 20.50 horsepower 91.75 dtype: float64

We can get the average mpg for every (model_year, cylinders) combination using the DataFrameGroupBy object's mean method.

means = grouped.mean() means.head()
mpg displacement horsepower weight acceleration
model_year cylinders
70 6 20.500000 199.000000 91.750000 2710.500000 15.500000
8 14.111111 367.555556 183.666667 3940.055556 11.194444
71 4 24.800000 118.200000 79.500000 2178.600000 17.600000
6 18.000000 243.375000 98.875000 3171.875000 14.750000
8 13.428571 371.714286 166.857143 4537.714286 12.214286

The index of the DateFrame that comes back from DataFrameGroupBy.mean is a multi-level index.

means.index
MultiIndex([(70, 6), (70, 8), (71, 4), (71, 6), (71, 8), (72, 4), (72, 8), (73, 4), (73, 6), (73, 8), (74, 4), (74, 6), (74, 8), (75, 4), (75, 6), (75, 8), (76, 4), (76, 6), (76, 8), (77, 4), (77, 6), (77, 8), (78, 4), (78, 6), (78, 8), (79, 4), (79, 6), (79, 8), (80, 4), (80, 6), (81, 4), (81, 6), (81, 8), (82, 4), (82, 6)], names=['model_year', 'cylinders'])

Here's an example from our book that illustrates an important idea when passing a Series object as the field to group by. Consider the following DataFrame named df.

df = pd.DataFrame({ 'key1': ['a','a', 'b', 'b', 'a'], 'key2': ['one', 'two', 'one', 'two', 'one'], 'data1': np.random.randn(5), 'data2': np.random.randn(5) }); df
key1 key2 data1 data2
0 a one 1.628402 0.630337
1 a two -1.107394 0.022734
2 b one 0.475426 -0.792190
3 b two -1.547331 -1.789632
4 a one 1.078556 -1.490519

The book uses the following code to compute the average of the data1 column, grouped by the values in the key1 column.

df['data1'].groupby(df['key1']).mean()
key1 a 0.533188 b -0.535952 Name: data1, dtype: float64
print((df['data1'])) print(df['key1'])
0 1.628402 1 -1.107394 2 0.475426 3 -1.547331 4 1.078556 Name: data1, dtype: float64 0 a 1 a 2 b 3 b 4 a Name: key1, dtype: object

Another way to write this is as follows:

df.groupby(df['key1']).mean()['data1']
key1 a 0.533188 b -0.535952 Name: data1, dtype: float64

Look carefully at the first example again, particularly focusing on what's being grouped:

df['data1'].groupby(df['key1']).mean()

Does this raise any questions in your head?

Consider the code in the following 2 cells.

key1 = pd.Series(data=['a', 'b', 'b', 'a', 'a'], index=[5, 0, 7, 8, 9]) print(df.head()) print() print(key1)
key1 key2 data1 data2 0 a one 1.628402 0.630337 1 a two -1.107394 0.022734 2 b one 0.475426 -0.792190 3 b two -1.547331 -1.789632 4 a one 1.078556 -1.490519 5 a 0 b 7 b 8 a 9 a dtype: object
grouped_by_key1 = df['data1'].groupby(key1) grouped_by_key1.groups
{'b': [0]}

What gives? The reason the original code works is because the index values are the same.

df['data1'].groupby(df['key1']).mean()

In the version above:

key1 = pd.Series(data=df['data1'].values, index=[5, 6, 7, 8, 9]) grouped_by_key1 = df['data1'].groupby(key1)

there's no match in key1 from the index values in df['data1'], so we don't get any groups.

Now let's look at grouping by a function. The function will be passed values of the index, so we'll transform our mpg dataset by using the name column values as the index.

usa_model_index = usa.set_index('name') usa_model_index.head()
mpg cylinders displacement horsepower weight acceleration model_year origin
name
chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 70 usa
buick skylark 320 15.0 8 350.0 165.0 3693 11.5 70 usa
plymouth satellite 18.0 8 318.0 150.0 3436 11.0 70 usa
amc rebel sst 16.0 8 304.0 150.0 3433 12.0 70 usa
ford torino 17.0 8 302.0 140.0 3449 10.5 70 usa

Now we can group by just the vehicle's make by creating a function that extracts the make from each index value.

def get_make(name): stripper = name.strip() parts = stripper.split() return parts[0] grouped_by_make = usa_model_index.groupby(lambda name: name.strip().split()[0]) grouped_by_make.groups.keys()
dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevroelt', 'chevrolet', 'chevy', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])

We see some problems here:

  • chevroelt should be chevrolet

  • chevy is an abbreviation for chevrolet.

A lambda function can only have one expression within it. We'll write an actual function instead to allow us to treat chevrolet, chevroelt, and chevy the same.

def get_make(name): stripper = name.strip() parts = stripper.split() if parts[0] in ["chevy", "chevrolet", "chevroelt"]: return "chevrolet" else: return parts[0]

Let's pass the get_make function as the argument to groupby and look at the resulting groups.

grouped_by_make = usa_model_index.groupby(get_make) grouped_by_make.groups.keys()
dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevrolet', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])

Now find the average mpg for each make based on the keys returned by the get_make function.

print(grouped_by_make.mean()['mpg']) print(grouped_by_make['mpg'].mean())
name amc 18.246429 buick 19.182353 cadillac 19.750000 capri 25.000000 chevrolet 20.219149 chrysler 17.266667 dodge 22.060714 ford 19.694118 hi 9.000000 mercury 19.118182 oldsmobile 21.100000 plymouth 21.703226 pontiac 20.012500 Name: mpg, dtype: float64 name amc 18.246429 buick 19.182353 cadillac 19.750000 capri 25.000000 chevrolet 20.219149 chrysler 17.266667 dodge 22.060714 ford 19.694118 hi 9.000000 mercury 19.118182 oldsmobile 21.100000 plymouth 21.703226 pontiac 20.012500 Name: mpg, dtype: float64

Get the average miles per gallon for the 3 makes that appear the most often in our data set.

To get started, sort the result of calling the size aggregate function on grouped_by_make in descending order, storing the result in a variable named most_frequently_occurring.

size_by_make = grouped_by_make.size() most_frequently_occurring = size_by_make.sort_values(ascending=False) makes_of_most_frequently_occurring = most_frequently_occurring.index[0:3] makes_of_most_frequently_occurring
Index(['ford', 'chevrolet', 'plymouth'], dtype='object', name='name')

Now use the index values from most_frequently_occurring to select elements from mean_mpg_by_make.

mean_mpg_by_make = grouped_by_make.mean()['mpg'] mean_mpg_by_make.loc[makes_of_most_frequently_occurring]
name ford 19.694118 chevrolet 20.219149 plymouth 21.703226 Name: mpg, dtype: float64

We can also create a new DataFrame from the Series objects mean_mpg_by_make and size_by_make, sort it using sort_values, and get the first 3 elements using iloc.

summary = pd.DataFrame( data = { "Avg. MPG": mean_mpg_by_make, "Num. Vehicles": size_by_make } ) summary.sort_values('Num. Vehicles', ascending=False).iloc[0:3]

Finally, look at summarizing by Make and Year. Get started by creating a multi-level index on those 2 columns.

usa_model_year_idx = usa.set_index(['name', 'model_year']) usa_model_year_idx.head()

Now we can group by a combination of the car's make and year

And find the average miles per gallon for all vehicles manufactured by Ford in 1971.