Project: CSCI 195

Path: Class Samples / datascience / group_by_lecture.ipynb

Views: ⁵⁹³⁰
Image: ubuntu2004

Kernel: Python 3 (system-wide)

Using groupby to summarize information

Import some data from the Seaborn sample data sets, and do some exploration.

In [1]:

import pandas as pd
import numpy as np

import seaborn as sns
mpg = sns.load_dataset('mpg')
mpg.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	usa	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693	11.5	70	usa	buick skylark 320
2	18.0	8	318.0	150.0	3436	11.0	70	usa	plymouth satellite
3	16.0	8	304.0	150.0	3433	12.0	70	usa	amc rebel sst
4	17.0	8	302.0	140.0	3449	10.5	70	usa	ford torino

Let's work with only cars made in the USA.

Look at what possible values are in the origin column.
Create a new DataFrame named usa with only those rows in mpg for which the value of the origin column is usa.
Verify there is only 1 value in the origin column in `usa.

In [2]:

mpg['origin'].unique()
usa = mpg[mpg.loc[:, 'origin'] == 'usa']
print(usa['origin'].unique())
print(usa.head())

['usa']
    mpg  cylinders  displacement  horsepower  weight  acceleration  \
18.0          8         307.0       130.0    3504          12.0   
15.0          8         350.0       165.0    3693          11.5   
18.0          8         318.0       150.0    3436          11.0   
16.0          8         304.0       150.0    3433          12.0   
17.0          8         302.0       140.0    3449          10.5   

   model_year origin                       name  
        70    usa  chevrolet chevelle malibu  
        70    usa          buick skylark 320  
        70    usa         plymouth satellite  
        70    usa              amc rebel sst  
        70    usa                ford torino  

Now let's specify that we want to create some groupings based on the model_year attribute and see what is contained in the groups property of the DataFrameGroupBy generated by calling groupby.

In [3]:

grouped = usa.groupby('model_year')
grouped.groups

{70: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 24, 25, 26, 27, 28], 71: [30, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 56], 72: [58, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75, 80, 83], 73: [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 104, 105, 106, 107, 109, 112, 113, 115, 116, 121, 124], 74: [125, 126, 127, 128, 130, 132, 133, 134, 135, 136, 137, 138, 139, 140, 146], 75: [152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 168, 169, 170, 174, 176], 76: [184, 185, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 199, 200, 201, 202, 206, 208, 212, 213, 214, 215], 77: [217, 219, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 234, 236, 237, 238], 78: [245, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 269, 271, 272], 79: [280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 295, 296, 298, 300, 301, 302, 305, 306, 307, 308], 80: [311, 313, 314, 315, 316, 323, 336], 81: [338, 339, 340, 341, 342, 344, 350, 351, 352, 363, 364, 365, 366], 82: [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 386, 387, 388, 389, 391, 392, 393, 395, 396, 397]}

Let's see what the values associated with the key 70 in grouped.groups are.

In [4]:

usa.loc[grouped.groups[70]].head(20)

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	usa	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693	11.5	70	usa	buick skylark 320
2	18.0	8	318.0	150.0	3436	11.0	70	usa	plymouth satellite
3	16.0	8	304.0	150.0	3433	12.0	70	usa	amc rebel sst
4	17.0	8	302.0	140.0	3449	10.5	70	usa	ford torino
5	15.0	8	429.0	198.0	4341	10.0	70	usa	ford galaxie 500
6	14.0	8	454.0	220.0	4354	9.0	70	usa	chevrolet impala
7	14.0	8	440.0	215.0	4312	8.5	70	usa	plymouth fury iii
8	14.0	8	455.0	225.0	4425	10.0	70	usa	pontiac catalina
9	15.0	8	390.0	190.0	3850	8.5	70	usa	amc ambassador dpl
10	15.0	8	383.0	170.0	3563	10.0	70	usa	dodge challenger se
11	14.0	8	340.0	160.0	3609	8.0	70	usa	plymouth 'cuda 340
12	15.0	8	400.0	150.0	3761	9.5	70	usa	chevrolet monte carlo
13	14.0	8	455.0	225.0	3086	10.0	70	usa	buick estate wagon (sw)
15	22.0	6	198.0	95.0	2833	15.5	70	usa	plymouth duster
16	18.0	6	199.0	97.0	2774	15.5	70	usa	amc hornet
17	21.0	6	200.0	85.0	2587	16.0	70	usa	ford maverick
24	21.0	6	199.0	90.0	2648	15.0	70	usa	amc gremlin
25	10.0	8	360.0	215.0	4615	14.0	70	usa	ford f250
26	10.0	8	307.0	200.0	4376	15.0	70	usa	chevy c20

Remember loc uses index values, not list positions. Let's prove that a minute. First we'll change the index values.

In [5]:

new_index = pd.RangeIndex(100, len(usa)+100)
usa_with_new_index = usa.set_index(new_index)
usa_with_new_index.head()
#print(len(usa))

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
100	18.0	8	307.0	130.0	3504	12.0	70	usa	chevrolet chevelle malibu
101	15.0	8	350.0	165.0	3693	11.5	70	usa	buick skylark 320
102	18.0	8	318.0	150.0	3436	11.0	70	usa	plymouth satellite
103	16.0	8	304.0	150.0	3433	12.0	70	usa	amc rebel sst
104	17.0	8	302.0	140.0	3449	10.5	70	usa	ford torino

Now re-compute the groups based on the model_year attribute.

In [6]:

new_index_grouped = usa_with_new_index.groupby('model_year')
new_index_grouped.groups

{70: [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121], 71: [122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141], 72: [142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159], 73: [160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188], 74: [189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203], 75: [204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223], 76: [224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245], 77: [246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263], 78: [264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285], 79: [286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308], 80: [309, 310, 311, 312, 313, 314, 315], 81: [316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328], 82: [329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348]}

This hopefully proves that the values in the groups dictionary represent index values.

Now let's look at grouping by multiple attributes; in particular, we'll group by model_year followed by cylinders.

In [7]:

grouped = usa.groupby(['model_year', 'cylinders'])
grouped.groups

{(70, 6): [15, 16, 17, 24], (70, 8): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 25, 26, 27, 28], (71, 4): [30, 32, 46, 49, 56], (71, 6): [33, 34, 35, 36, 37, 45, 47, 48], (71, 8): [38, 39, 40, 41, 42, 43, 44], (72, 4): [58, 60, 61, 80, 83], (72, 8): [62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75], (73, 4): [109, 112], (73, 6): [97, 98, 99, 100, 101, 107, 113], (73, 8): [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 103, 104, 105, 106, 115, 116, 121, 124], (74, 4): [130, 132, 146], (74, 6): [125, 126, 127, 128, 133, 134, 135], (74, 8): [136, 137, 138, 139, 140], (75, 4): [168, 170], (75, 6): [152, 153, 154, 155, 160, 161, 162, 163, 164, 169, 174, 176], (75, 8): [156, 157, 158, 159, 165, 166], (76, 4): [184, 185, 195, 196, 206], (76, 6): [191, 192, 193, 194, 199, 200, 201, 202], (76, 8): [187, 188, 189, 190, 208, 212, 213, 214, 215], (77, 4): [217, 219, 234, 236, 237, 238], (77, 6): [225, 226, 227, 228], (77, 8): [221, 222, 223, 224, 229, 230, 231, 232], (78, 4): [245, 255, 266, 269, 271, 272], (78, 6): [252, 253, 254, 256, 257, 258, 259, 260, 261, 263], (78, 8): [249, 250, 251, 262, 264, 265], (79, 4): [282, 295, 296, 301, 302, 305, 308], (79, 6): [280, 281, 283, 284, 306, 307], (79, 8): [285, 286, 287, 288, 289, 290, 291, 292, 298, 300], (80, 4): [311, 313, 314, 315, 323, 336], (80, 6): [316], (81, 4): [338, 339, 340, 342, 344, 350, 351, 352], (81, 6): [341, 363, 365, 366], (81, 8): [364], (82, 4): [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 388, 391, 392, 393, 395, 396, 397], (82, 6): [386, 387, 389]}

Now we could get the members of the 6-cylinder vehicles manufactured in 1970 like this

In [8]:

usa.loc[grouped.groups[(70,6)]]

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
15	22.0	6	198.0	95.0	2833	15.5	70	usa	plymouth duster
16	18.0	6	199.0	97.0	2774	15.5	70	usa	amc hornet
17	21.0	6	200.0	85.0	2587	16.0	70	usa	ford maverick
24	21.0	6	199.0	90.0	2648	15.0	70	usa	amc gremlin

And we can get the average mpg for that specific set of vehicles by finding them using loc, and then using the Series method mean to compute the average

In [9]:

usa.loc[grouped.groups[(70,6)],["mpg", "horsepower"]].mean()

mpg           20.50
horsepower    91.75
dtype: float64

We can get the average mpg for every (model_year, cylinders) combination using the DataFrameGroupBy object's mean method.

In [10]:

means = grouped.mean()
means.head()

		mpg	displacement	horsepower	weight	acceleration
model_year	cylinders
70	6	20.500000	199.000000	91.750000	2710.500000	15.500000
70	8	14.111111	367.555556	183.666667	3940.055556	11.194444
71	4	24.800000	118.200000	79.500000	2178.600000	17.600000
	6	18.000000	243.375000	98.875000	3171.875000	14.750000
	8	13.428571	371.714286	166.857143	4537.714286	12.214286

The index of the DateFrame that comes back from DataFrameGroupBy.mean is a multi-level index.

In [11]:

means.index

MultiIndex([(70, 6),
            (70, 8),
            (71, 4),
            (71, 6),
            (71, 8),
            (72, 4),
            (72, 8),
            (73, 4),
            (73, 6),
            (73, 8),
            (74, 4),
            (74, 6),
            (74, 8),
            (75, 4),
            (75, 6),
            (75, 8),
            (76, 4),
            (76, 6),
            (76, 8),
            (77, 4),
            (77, 6),
            (77, 8),
            (78, 4),
            (78, 6),
            (78, 8),
            (79, 4),
            (79, 6),
            (79, 8),
            (80, 4),
            (80, 6),
            (81, 4),
            (81, 6),
            (81, 8),
            (82, 4),
            (82, 6)],
           names=['model_year', 'cylinders'])

Here's an example from our book that illustrates an important idea when passing a Series object as the field to group by. Consider the following DataFrame named df.

In [12]:

df = pd.DataFrame({
        'key1': ['a','a', 'b', 'b', 'a'],
        'key2': ['one', 'two', 'one', 'two', 'one'],
        'data1': np.random.randn(5),
        'data2': np.random.randn(5)
});

df

	key1	key2	data1	data2
0	a	one	1.628402	0.630337
1	a	two	-1.107394	0.022734
2	b	one	0.475426	-0.792190
3	b	two	-1.547331	-1.789632
4	a	one	1.078556	-1.490519

The book uses the following code to compute the average of the data1 column, grouped by the values in the key1 column.

In [13]:

df['data1'].groupby(df['key1']).mean()

key1
a    0.533188
b   -0.535952
Name: data1, dtype: float64

In [14]:

print((df['data1']))
print(df['key1'])

  1.628402
 -1.107394
  0.475426
 -1.547331
  1.078556
Name: data1, dtype: float64
  a
  a
  b
  b
  a
Name: key1, dtype: object

Another way to write this is as follows:

In [15]:

df.groupby(df['key1']).mean()['data1']

key1
a    0.533188
b   -0.535952
Name: data1, dtype: float64

Look carefully at the first example again, particularly focusing on what's being grouped:

df['data1'].groupby(df['key1']).mean()

Does this raise any questions in your head?

Consider the code in the following 2 cells.

In [16]:

key1 = pd.Series(data=['a', 'b', 'b', 'a', 'a'], index=[5, 0, 7, 8, 9])
print(df.head())
print()
print(key1)

  key1 key2     data1     data2
  a  one  1.628402  0.630337
  a  two -1.107394  0.022734
  b  one  0.475426 -0.792190
  b  two -1.547331 -1.789632
  a  one  1.078556 -1.490519

  a
  b
  b
  a
  a
dtype: object

In [17]:

grouped_by_key1 = df['data1'].groupby(key1)
grouped_by_key1.groups

{'b': [0]}

What gives? The reason the original code works is because the index values are the same.

df['data1'].groupby(df['key1']).mean()

In the version above:

key1 = pd.Series(data=df['data1'].values, index=[5, 6, 7, 8, 9])
grouped_by_key1 = df['data1'].groupby(key1)

there's no match in key1 from the index values in df['data1'], so we don't get any groups.

Now let's look at grouping by a function. The function will be passed values of the index, so we'll transform our mpg dataset by using the name column values as the index.

In [18]:

usa_model_index = usa.set_index('name')
usa_model_index.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin
name
chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	70	usa
buick skylark 320	15.0	8	350.0	165.0	3693	11.5	70	usa
plymouth satellite	18.0	8	318.0	150.0	3436	11.0	70	usa
amc rebel sst	16.0	8	304.0	150.0	3433	12.0	70	usa
ford torino	17.0	8	302.0	140.0	3449	10.5	70	usa

Now we can group by just the vehicle's make by creating a function that extracts the make from each index value.

In [19]:

def get_make(name):
    stripper = name.strip()
    parts = stripper.split()
    return parts[0]

grouped_by_make = usa_model_index.groupby(lambda name: name.strip().split()[0])
grouped_by_make.groups.keys()

dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevroelt', 'chevrolet', 'chevy', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])

We see some problems here:

chevroelt should be chevrolet
chevy is an abbreviation for chevrolet.

A lambda function can only have one expression within it. We'll write an actual function instead to allow us to treat chevrolet, chevroelt, and chevy the same.

In [20]:

def get_make(name):
    stripper = name.strip()
    parts = stripper.split()
    if parts[0] in ["chevy", "chevrolet", "chevroelt"]:
        return "chevrolet"
    else:
        return parts[0]

Let's pass the get_make function as the argument to groupby and look at the resulting groups.

In [21]:

grouped_by_make = usa_model_index.groupby(get_make)
grouped_by_make.groups.keys()

dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevrolet', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])

Now find the average mpg for each make based on the keys returned by the get_make function.

In [22]:

print(grouped_by_make.mean()['mpg'])
print(grouped_by_make['mpg'].mean())

name
amc           18.246429
buick         19.182353
cadillac      19.750000
capri         25.000000
chevrolet     20.219149
chrysler      17.266667
dodge         22.060714
ford          19.694118
hi             9.000000
mercury       19.118182
oldsmobile    21.100000
plymouth      21.703226
pontiac       20.012500
Name: mpg, dtype: float64
name
amc           18.246429
buick         19.182353
cadillac      19.750000
capri         25.000000
chevrolet     20.219149
chrysler      17.266667
dodge         22.060714
ford          19.694118
hi             9.000000
mercury       19.118182
oldsmobile    21.100000
plymouth      21.703226
pontiac       20.012500
Name: mpg, dtype: float64

Get the average miles per gallon for the 3 makes that appear the most often in our data set.

To get started, sort the result of calling the size aggregate function on grouped_by_make in descending order, storing the result in a variable named most_frequently_occurring.

In [31]:

size_by_make = grouped_by_make.size()
most_frequently_occurring = size_by_make.sort_values(ascending=False)
makes_of_most_frequently_occurring = most_frequently_occurring.index[0:3]
makes_of_most_frequently_occurring

Index(['ford', 'chevrolet', 'plymouth'], dtype='object', name='name')

Now use the index values from most_frequently_occurring to select elements from mean_mpg_by_make.

In [32]:

mean_mpg_by_make = grouped_by_make.mean()['mpg']
mean_mpg_by_make.loc[makes_of_most_frequently_occurring]

name
ford         19.694118
chevrolet    20.219149
plymouth     21.703226
Name: mpg, dtype: float64

We can also create a new DataFrame from the Series objects mean_mpg_by_make and size_by_make, sort it using sort_values, and get the first 3 elements using iloc.

In [0]:

summary = pd.DataFrame(
  data = {
        "Avg. MPG": mean_mpg_by_make,
        "Num. Vehicles": size_by_make
    }
)
summary.sort_values('Num. Vehicles', ascending=False).iloc[0:3]

Finally, look at summarizing by Make and Year. Get started by creating a multi-level index on those 2 columns.

In [0]:

usa_model_year_idx = usa.set_index(['name', 'model_year'])
usa_model_year_idx.head()

Now we can group by a combination of the car's make and year

In [0]:

And find the average miles per gallon for all vehicles manufactured by Ford in 1971.

In [0]: