{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# Using groupby to summarize information\n",
"\n",
"Import some data from the Seaborn sample data sets, and do some exploration."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mpg | \n",
" cylinders | \n",
" displacement | \n",
" horsepower | \n",
" weight | \n",
" acceleration | \n",
" model_year | \n",
" origin | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 18.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 130.0 | \n",
" 3504 | \n",
" 12.0 | \n",
" 70 | \n",
" usa | \n",
" chevrolet chevelle malibu | \n",
"
\n",
" \n",
" 1 | \n",
" 15.0 | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 3693 | \n",
" 11.5 | \n",
" 70 | \n",
" usa | \n",
" buick skylark 320 | \n",
"
\n",
" \n",
" 2 | \n",
" 18.0 | \n",
" 8 | \n",
" 318.0 | \n",
" 150.0 | \n",
" 3436 | \n",
" 11.0 | \n",
" 70 | \n",
" usa | \n",
" plymouth satellite | \n",
"
\n",
" \n",
" 3 | \n",
" 16.0 | \n",
" 8 | \n",
" 304.0 | \n",
" 150.0 | \n",
" 3433 | \n",
" 12.0 | \n",
" 70 | \n",
" usa | \n",
" amc rebel sst | \n",
"
\n",
" \n",
" 4 | \n",
" 17.0 | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3449 | \n",
" 10.5 | \n",
" 70 | \n",
" usa | \n",
" ford torino | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mpg cylinders displacement horsepower weight acceleration \\\n",
"0 18.0 8 307.0 130.0 3504 12.0 \n",
"1 15.0 8 350.0 165.0 3693 11.5 \n",
"2 18.0 8 318.0 150.0 3436 11.0 \n",
"3 16.0 8 304.0 150.0 3433 12.0 \n",
"4 17.0 8 302.0 140.0 3449 10.5 \n",
"\n",
" model_year origin name \n",
"0 70 usa chevrolet chevelle malibu \n",
"1 70 usa buick skylark 320 \n",
"2 70 usa plymouth satellite \n",
"3 70 usa amc rebel sst \n",
"4 70 usa ford torino "
]
},
"execution_count": 1,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"import seaborn as sns\n",
"mpg = sns.load_dataset('mpg')\n",
"mpg.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Let's work with only cars made in the USA. \n",
"* Look at what possible values are in the `origin` column.\n",
"* Create a new `DataFrame` named `usa` with only those rows in `mpg` for which the value of the `origin` column is **usa**.\n",
"* Verify there is only 1 value in the `origin` column in `usa."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['usa']\n",
" mpg cylinders displacement horsepower weight acceleration \\\n",
"0 18.0 8 307.0 130.0 3504 12.0 \n",
"1 15.0 8 350.0 165.0 3693 11.5 \n",
"2 18.0 8 318.0 150.0 3436 11.0 \n",
"3 16.0 8 304.0 150.0 3433 12.0 \n",
"4 17.0 8 302.0 140.0 3449 10.5 \n",
"\n",
" model_year origin name \n",
"0 70 usa chevrolet chevelle malibu \n",
"1 70 usa buick skylark 320 \n",
"2 70 usa plymouth satellite \n",
"3 70 usa amc rebel sst \n",
"4 70 usa ford torino \n"
]
}
],
"source": [
"mpg['origin'].unique()\n",
"usa = mpg[mpg.loc[:, 'origin'] == 'usa']\n",
"print(usa['origin'].unique())\n",
"print(usa.head())"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now let's specify that we want to create some groupings based on the `model_year` attribute and see what is contained in the `groups` property of the `DataFrameGroupBy` generated by calling `groupby`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{70: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 24, 25, 26, 27, 28], 71: [30, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 56], 72: [58, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75, 80, 83], 73: [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 104, 105, 106, 107, 109, 112, 113, 115, 116, 121, 124], 74: [125, 126, 127, 128, 130, 132, 133, 134, 135, 136, 137, 138, 139, 140, 146], 75: [152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 168, 169, 170, 174, 176], 76: [184, 185, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 199, 200, 201, 202, 206, 208, 212, 213, 214, 215], 77: [217, 219, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 234, 236, 237, 238], 78: [245, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 269, 271, 272], 79: [280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 295, 296, 298, 300, 301, 302, 305, 306, 307, 308], 80: [311, 313, 314, 315, 316, 323, 336], 81: [338, 339, 340, 341, 342, 344, 350, 351, 352, 363, 364, 365, 366], 82: [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 386, 387, 388, 389, 391, 392, 393, 395, 396, 397]}"
]
},
"execution_count": 3,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"grouped = usa.groupby('model_year')\n",
"grouped.groups"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Let's see what the values associated with the key **70** in `grouped.groups` are."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mpg | \n",
" cylinders | \n",
" displacement | \n",
" horsepower | \n",
" weight | \n",
" acceleration | \n",
" model_year | \n",
" origin | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 18.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 130.0 | \n",
" 3504 | \n",
" 12.0 | \n",
" 70 | \n",
" usa | \n",
" chevrolet chevelle malibu | \n",
"
\n",
" \n",
" 1 | \n",
" 15.0 | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 3693 | \n",
" 11.5 | \n",
" 70 | \n",
" usa | \n",
" buick skylark 320 | \n",
"
\n",
" \n",
" 2 | \n",
" 18.0 | \n",
" 8 | \n",
" 318.0 | \n",
" 150.0 | \n",
" 3436 | \n",
" 11.0 | \n",
" 70 | \n",
" usa | \n",
" plymouth satellite | \n",
"
\n",
" \n",
" 3 | \n",
" 16.0 | \n",
" 8 | \n",
" 304.0 | \n",
" 150.0 | \n",
" 3433 | \n",
" 12.0 | \n",
" 70 | \n",
" usa | \n",
" amc rebel sst | \n",
"
\n",
" \n",
" 4 | \n",
" 17.0 | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3449 | \n",
" 10.5 | \n",
" 70 | \n",
" usa | \n",
" ford torino | \n",
"
\n",
" \n",
" 5 | \n",
" 15.0 | \n",
" 8 | \n",
" 429.0 | \n",
" 198.0 | \n",
" 4341 | \n",
" 10.0 | \n",
" 70 | \n",
" usa | \n",
" ford galaxie 500 | \n",
"
\n",
" \n",
" 6 | \n",
" 14.0 | \n",
" 8 | \n",
" 454.0 | \n",
" 220.0 | \n",
" 4354 | \n",
" 9.0 | \n",
" 70 | \n",
" usa | \n",
" chevrolet impala | \n",
"
\n",
" \n",
" 7 | \n",
" 14.0 | \n",
" 8 | \n",
" 440.0 | \n",
" 215.0 | \n",
" 4312 | \n",
" 8.5 | \n",
" 70 | \n",
" usa | \n",
" plymouth fury iii | \n",
"
\n",
" \n",
" 8 | \n",
" 14.0 | \n",
" 8 | \n",
" 455.0 | \n",
" 225.0 | \n",
" 4425 | \n",
" 10.0 | \n",
" 70 | \n",
" usa | \n",
" pontiac catalina | \n",
"
\n",
" \n",
" 9 | \n",
" 15.0 | \n",
" 8 | \n",
" 390.0 | \n",
" 190.0 | \n",
" 3850 | \n",
" 8.5 | \n",
" 70 | \n",
" usa | \n",
" amc ambassador dpl | \n",
"
\n",
" \n",
" 10 | \n",
" 15.0 | \n",
" 8 | \n",
" 383.0 | \n",
" 170.0 | \n",
" 3563 | \n",
" 10.0 | \n",
" 70 | \n",
" usa | \n",
" dodge challenger se | \n",
"
\n",
" \n",
" 11 | \n",
" 14.0 | \n",
" 8 | \n",
" 340.0 | \n",
" 160.0 | \n",
" 3609 | \n",
" 8.0 | \n",
" 70 | \n",
" usa | \n",
" plymouth 'cuda 340 | \n",
"
\n",
" \n",
" 12 | \n",
" 15.0 | \n",
" 8 | \n",
" 400.0 | \n",
" 150.0 | \n",
" 3761 | \n",
" 9.5 | \n",
" 70 | \n",
" usa | \n",
" chevrolet monte carlo | \n",
"
\n",
" \n",
" 13 | \n",
" 14.0 | \n",
" 8 | \n",
" 455.0 | \n",
" 225.0 | \n",
" 3086 | \n",
" 10.0 | \n",
" 70 | \n",
" usa | \n",
" buick estate wagon (sw) | \n",
"
\n",
" \n",
" 15 | \n",
" 22.0 | \n",
" 6 | \n",
" 198.0 | \n",
" 95.0 | \n",
" 2833 | \n",
" 15.5 | \n",
" 70 | \n",
" usa | \n",
" plymouth duster | \n",
"
\n",
" \n",
" 16 | \n",
" 18.0 | \n",
" 6 | \n",
" 199.0 | \n",
" 97.0 | \n",
" 2774 | \n",
" 15.5 | \n",
" 70 | \n",
" usa | \n",
" amc hornet | \n",
"
\n",
" \n",
" 17 | \n",
" 21.0 | \n",
" 6 | \n",
" 200.0 | \n",
" 85.0 | \n",
" 2587 | \n",
" 16.0 | \n",
" 70 | \n",
" usa | \n",
" ford maverick | \n",
"
\n",
" \n",
" 24 | \n",
" 21.0 | \n",
" 6 | \n",
" 199.0 | \n",
" 90.0 | \n",
" 2648 | \n",
" 15.0 | \n",
" 70 | \n",
" usa | \n",
" amc gremlin | \n",
"
\n",
" \n",
" 25 | \n",
" 10.0 | \n",
" 8 | \n",
" 360.0 | \n",
" 215.0 | \n",
" 4615 | \n",
" 14.0 | \n",
" 70 | \n",
" usa | \n",
" ford f250 | \n",
"
\n",
" \n",
" 26 | \n",
" 10.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 200.0 | \n",
" 4376 | \n",
" 15.0 | \n",
" 70 | \n",
" usa | \n",
" chevy c20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mpg cylinders displacement horsepower weight acceleration \\\n",
"0 18.0 8 307.0 130.0 3504 12.0 \n",
"1 15.0 8 350.0 165.0 3693 11.5 \n",
"2 18.0 8 318.0 150.0 3436 11.0 \n",
"3 16.0 8 304.0 150.0 3433 12.0 \n",
"4 17.0 8 302.0 140.0 3449 10.5 \n",
"5 15.0 8 429.0 198.0 4341 10.0 \n",
"6 14.0 8 454.0 220.0 4354 9.0 \n",
"7 14.0 8 440.0 215.0 4312 8.5 \n",
"8 14.0 8 455.0 225.0 4425 10.0 \n",
"9 15.0 8 390.0 190.0 3850 8.5 \n",
"10 15.0 8 383.0 170.0 3563 10.0 \n",
"11 14.0 8 340.0 160.0 3609 8.0 \n",
"12 15.0 8 400.0 150.0 3761 9.5 \n",
"13 14.0 8 455.0 225.0 3086 10.0 \n",
"15 22.0 6 198.0 95.0 2833 15.5 \n",
"16 18.0 6 199.0 97.0 2774 15.5 \n",
"17 21.0 6 200.0 85.0 2587 16.0 \n",
"24 21.0 6 199.0 90.0 2648 15.0 \n",
"25 10.0 8 360.0 215.0 4615 14.0 \n",
"26 10.0 8 307.0 200.0 4376 15.0 \n",
"\n",
" model_year origin name \n",
"0 70 usa chevrolet chevelle malibu \n",
"1 70 usa buick skylark 320 \n",
"2 70 usa plymouth satellite \n",
"3 70 usa amc rebel sst \n",
"4 70 usa ford torino \n",
"5 70 usa ford galaxie 500 \n",
"6 70 usa chevrolet impala \n",
"7 70 usa plymouth fury iii \n",
"8 70 usa pontiac catalina \n",
"9 70 usa amc ambassador dpl \n",
"10 70 usa dodge challenger se \n",
"11 70 usa plymouth 'cuda 340 \n",
"12 70 usa chevrolet monte carlo \n",
"13 70 usa buick estate wagon (sw) \n",
"15 70 usa plymouth duster \n",
"16 70 usa amc hornet \n",
"17 70 usa ford maverick \n",
"24 70 usa amc gremlin \n",
"25 70 usa ford f250 \n",
"26 70 usa chevy c20 "
]
},
"execution_count": 4,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"usa.loc[grouped.groups[70]].head(20)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Remember `loc` uses **index values**, not **list positions**. Let's prove that a minute. First we'll change the index values."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mpg | \n",
" cylinders | \n",
" displacement | \n",
" horsepower | \n",
" weight | \n",
" acceleration | \n",
" model_year | \n",
" origin | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 100 | \n",
" 18.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 130.0 | \n",
" 3504 | \n",
" 12.0 | \n",
" 70 | \n",
" usa | \n",
" chevrolet chevelle malibu | \n",
"
\n",
" \n",
" 101 | \n",
" 15.0 | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 3693 | \n",
" 11.5 | \n",
" 70 | \n",
" usa | \n",
" buick skylark 320 | \n",
"
\n",
" \n",
" 102 | \n",
" 18.0 | \n",
" 8 | \n",
" 318.0 | \n",
" 150.0 | \n",
" 3436 | \n",
" 11.0 | \n",
" 70 | \n",
" usa | \n",
" plymouth satellite | \n",
"
\n",
" \n",
" 103 | \n",
" 16.0 | \n",
" 8 | \n",
" 304.0 | \n",
" 150.0 | \n",
" 3433 | \n",
" 12.0 | \n",
" 70 | \n",
" usa | \n",
" amc rebel sst | \n",
"
\n",
" \n",
" 104 | \n",
" 17.0 | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3449 | \n",
" 10.5 | \n",
" 70 | \n",
" usa | \n",
" ford torino | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mpg cylinders displacement horsepower weight acceleration \\\n",
"100 18.0 8 307.0 130.0 3504 12.0 \n",
"101 15.0 8 350.0 165.0 3693 11.5 \n",
"102 18.0 8 318.0 150.0 3436 11.0 \n",
"103 16.0 8 304.0 150.0 3433 12.0 \n",
"104 17.0 8 302.0 140.0 3449 10.5 \n",
"\n",
" model_year origin name \n",
"100 70 usa chevrolet chevelle malibu \n",
"101 70 usa buick skylark 320 \n",
"102 70 usa plymouth satellite \n",
"103 70 usa amc rebel sst \n",
"104 70 usa ford torino "
]
},
"execution_count": 5,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"new_index = pd.RangeIndex(100, len(usa)+100)\n",
"usa_with_new_index = usa.set_index(new_index)\n",
"usa_with_new_index.head()\n",
"#print(len(usa))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now re-compute the groups based on the `model_year` attribute."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{70: [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121], 71: [122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141], 72: [142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159], 73: [160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188], 74: [189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203], 75: [204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223], 76: [224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245], 77: [246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263], 78: [264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285], 79: [286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308], 80: [309, 310, 311, 312, 313, 314, 315], 81: [316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328], 82: [329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348]}"
]
},
"execution_count": 6,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"new_index_grouped = usa_with_new_index.groupby('model_year')\n",
"new_index_grouped.groups"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"This hopefully proves that the values in the `groups` dictionary represent **index values**."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now let's look at grouping by **multiple attributes**; in particular, we'll group by `model_year` followed by `cylinders`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{(70, 6): [15, 16, 17, 24], (70, 8): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 25, 26, 27, 28], (71, 4): [30, 32, 46, 49, 56], (71, 6): [33, 34, 35, 36, 37, 45, 47, 48], (71, 8): [38, 39, 40, 41, 42, 43, 44], (72, 4): [58, 60, 61, 80, 83], (72, 8): [62, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75], (73, 4): [109, 112], (73, 6): [97, 98, 99, 100, 101, 107, 113], (73, 8): [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 103, 104, 105, 106, 115, 116, 121, 124], (74, 4): [130, 132, 146], (74, 6): [125, 126, 127, 128, 133, 134, 135], (74, 8): [136, 137, 138, 139, 140], (75, 4): [168, 170], (75, 6): [152, 153, 154, 155, 160, 161, 162, 163, 164, 169, 174, 176], (75, 8): [156, 157, 158, 159, 165, 166], (76, 4): [184, 185, 195, 196, 206], (76, 6): [191, 192, 193, 194, 199, 200, 201, 202], (76, 8): [187, 188, 189, 190, 208, 212, 213, 214, 215], (77, 4): [217, 219, 234, 236, 237, 238], (77, 6): [225, 226, 227, 228], (77, 8): [221, 222, 223, 224, 229, 230, 231, 232], (78, 4): [245, 255, 266, 269, 271, 272], (78, 6): [252, 253, 254, 256, 257, 258, 259, 260, 261, 263], (78, 8): [249, 250, 251, 262, 264, 265], (79, 4): [282, 295, 296, 301, 302, 305, 308], (79, 6): [280, 281, 283, 284, 306, 307], (79, 8): [285, 286, 287, 288, 289, 290, 291, 292, 298, 300], (80, 4): [311, 313, 314, 315, 323, 336], (80, 6): [316], (81, 4): [338, 339, 340, 342, 344, 350, 351, 352], (81, 6): [341, 363, 365, 366], (81, 8): [364], (82, 4): [367, 368, 369, 370, 371, 372, 373, 374, 378, 379, 388, 391, 392, 393, 395, 396, 397], (82, 6): [386, 387, 389]}"
]
},
"execution_count": 7,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"grouped = usa.groupby(['model_year', 'cylinders'])\n",
"grouped.groups"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now we could get the members of the 6-cylinder vehicles manufactured in 1970 like this"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mpg | \n",
" cylinders | \n",
" displacement | \n",
" horsepower | \n",
" weight | \n",
" acceleration | \n",
" model_year | \n",
" origin | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 15 | \n",
" 22.0 | \n",
" 6 | \n",
" 198.0 | \n",
" 95.0 | \n",
" 2833 | \n",
" 15.5 | \n",
" 70 | \n",
" usa | \n",
" plymouth duster | \n",
"
\n",
" \n",
" 16 | \n",
" 18.0 | \n",
" 6 | \n",
" 199.0 | \n",
" 97.0 | \n",
" 2774 | \n",
" 15.5 | \n",
" 70 | \n",
" usa | \n",
" amc hornet | \n",
"
\n",
" \n",
" 17 | \n",
" 21.0 | \n",
" 6 | \n",
" 200.0 | \n",
" 85.0 | \n",
" 2587 | \n",
" 16.0 | \n",
" 70 | \n",
" usa | \n",
" ford maverick | \n",
"
\n",
" \n",
" 24 | \n",
" 21.0 | \n",
" 6 | \n",
" 199.0 | \n",
" 90.0 | \n",
" 2648 | \n",
" 15.0 | \n",
" 70 | \n",
" usa | \n",
" amc gremlin | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mpg cylinders displacement horsepower weight acceleration \\\n",
"15 22.0 6 198.0 95.0 2833 15.5 \n",
"16 18.0 6 199.0 97.0 2774 15.5 \n",
"17 21.0 6 200.0 85.0 2587 16.0 \n",
"24 21.0 6 199.0 90.0 2648 15.0 \n",
"\n",
" model_year origin name \n",
"15 70 usa plymouth duster \n",
"16 70 usa amc hornet \n",
"17 70 usa ford maverick \n",
"24 70 usa amc gremlin "
]
},
"execution_count": 8,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"usa.loc[grouped.groups[(70,6)]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"And we can get the average mpg for that specific set of vehicles by finding them using `loc`, and then using the `Series` method `mean` to compute the average"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"mpg 20.50\n",
"horsepower 91.75\n",
"dtype: float64"
]
},
"execution_count": 9,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"usa.loc[grouped.groups[(70,6)],[\"mpg\", \"horsepower\"]].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"We can get the average `mpg` for **every** `(model_year, cylinders)` combination using the `DataFrameGroupBy` object's `mean` method."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" mpg | \n",
" displacement | \n",
" horsepower | \n",
" weight | \n",
" acceleration | \n",
"
\n",
" \n",
" model_year | \n",
" cylinders | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 70 | \n",
" 6 | \n",
" 20.500000 | \n",
" 199.000000 | \n",
" 91.750000 | \n",
" 2710.500000 | \n",
" 15.500000 | \n",
"
\n",
" \n",
" 8 | \n",
" 14.111111 | \n",
" 367.555556 | \n",
" 183.666667 | \n",
" 3940.055556 | \n",
" 11.194444 | \n",
"
\n",
" \n",
" 71 | \n",
" 4 | \n",
" 24.800000 | \n",
" 118.200000 | \n",
" 79.500000 | \n",
" 2178.600000 | \n",
" 17.600000 | \n",
"
\n",
" \n",
" 6 | \n",
" 18.000000 | \n",
" 243.375000 | \n",
" 98.875000 | \n",
" 3171.875000 | \n",
" 14.750000 | \n",
"
\n",
" \n",
" 8 | \n",
" 13.428571 | \n",
" 371.714286 | \n",
" 166.857143 | \n",
" 4537.714286 | \n",
" 12.214286 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mpg displacement horsepower weight \\\n",
"model_year cylinders \n",
"70 6 20.500000 199.000000 91.750000 2710.500000 \n",
" 8 14.111111 367.555556 183.666667 3940.055556 \n",
"71 4 24.800000 118.200000 79.500000 2178.600000 \n",
" 6 18.000000 243.375000 98.875000 3171.875000 \n",
" 8 13.428571 371.714286 166.857143 4537.714286 \n",
"\n",
" acceleration \n",
"model_year cylinders \n",
"70 6 15.500000 \n",
" 8 11.194444 \n",
"71 4 17.600000 \n",
" 6 14.750000 \n",
" 8 12.214286 "
]
},
"execution_count": 10,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"means = grouped.mean()\n",
"means.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"The index of the `DateFrame` that comes back from `DataFrameGroupBy.mean` is a **multi-level** index."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"MultiIndex([(70, 6),\n",
" (70, 8),\n",
" (71, 4),\n",
" (71, 6),\n",
" (71, 8),\n",
" (72, 4),\n",
" (72, 8),\n",
" (73, 4),\n",
" (73, 6),\n",
" (73, 8),\n",
" (74, 4),\n",
" (74, 6),\n",
" (74, 8),\n",
" (75, 4),\n",
" (75, 6),\n",
" (75, 8),\n",
" (76, 4),\n",
" (76, 6),\n",
" (76, 8),\n",
" (77, 4),\n",
" (77, 6),\n",
" (77, 8),\n",
" (78, 4),\n",
" (78, 6),\n",
" (78, 8),\n",
" (79, 4),\n",
" (79, 6),\n",
" (79, 8),\n",
" (80, 4),\n",
" (80, 6),\n",
" (81, 4),\n",
" (81, 6),\n",
" (81, 8),\n",
" (82, 4),\n",
" (82, 6)],\n",
" names=['model_year', 'cylinders'])"
]
},
"execution_count": 11,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"means.index"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Here's an example from our book that illustrates an important idea when passing a `Series` object as the field to group by. Consider the following `DataFrame` named `df`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" key1 | \n",
" key2 | \n",
" data1 | \n",
" data2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" a | \n",
" one | \n",
" 1.628402 | \n",
" 0.630337 | \n",
"
\n",
" \n",
" 1 | \n",
" a | \n",
" two | \n",
" -1.107394 | \n",
" 0.022734 | \n",
"
\n",
" \n",
" 2 | \n",
" b | \n",
" one | \n",
" 0.475426 | \n",
" -0.792190 | \n",
"
\n",
" \n",
" 3 | \n",
" b | \n",
" two | \n",
" -1.547331 | \n",
" -1.789632 | \n",
"
\n",
" \n",
" 4 | \n",
" a | \n",
" one | \n",
" 1.078556 | \n",
" -1.490519 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" key1 key2 data1 data2\n",
"0 a one 1.628402 0.630337\n",
"1 a two -1.107394 0.022734\n",
"2 b one 0.475426 -0.792190\n",
"3 b two -1.547331 -1.789632\n",
"4 a one 1.078556 -1.490519"
]
},
"execution_count": 12,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({\n",
" 'key1': ['a','a', 'b', 'b', 'a'],\n",
" 'key2': ['one', 'two', 'one', 'two', 'one'],\n",
" 'data1': np.random.randn(5),\n",
" 'data2': np.random.randn(5)\n",
"});\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"The book uses the following code to compute the average of the `data1` column, grouped by the values in the `key1` column."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"key1\n",
"a 0.533188\n",
"b -0.535952\n",
"Name: data1, dtype: float64"
]
},
"execution_count": 13,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"df['data1'].groupby(df['key1']).mean()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1.628402\n",
"1 -1.107394\n",
"2 0.475426\n",
"3 -1.547331\n",
"4 1.078556\n",
"Name: data1, dtype: float64\n",
"0 a\n",
"1 a\n",
"2 b\n",
"3 b\n",
"4 a\n",
"Name: key1, dtype: object\n"
]
}
],
"source": [
"print((df['data1']))\n",
"print(df['key1'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Another way to write this is as follows:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"key1\n",
"a 0.533188\n",
"b -0.535952\n",
"Name: data1, dtype: float64"
]
},
"execution_count": 15,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(df['key1']).mean()['data1']"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Look carefully at the first example again, particularly focusing on what's being **grouped**:\n",
"```\n",
"df['data1'].groupby(df['key1']).mean()\n",
"```\n",
"\n",
"Does this raise any questions in your head?"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Consider the code in the following 2 cells."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" key1 key2 data1 data2\n",
"0 a one 1.628402 0.630337\n",
"1 a two -1.107394 0.022734\n",
"2 b one 0.475426 -0.792190\n",
"3 b two -1.547331 -1.789632\n",
"4 a one 1.078556 -1.490519\n",
"\n",
"5 a\n",
"0 b\n",
"7 b\n",
"8 a\n",
"9 a\n",
"dtype: object\n"
]
}
],
"source": [
"key1 = pd.Series(data=['a', 'b', 'b', 'a', 'a'], index=[5, 0, 7, 8, 9])\n",
"print(df.head())\n",
"print()\n",
"print(key1)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'b': [0]}"
]
},
"execution_count": 17,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"grouped_by_key1 = df['data1'].groupby(key1)\n",
"grouped_by_key1.groups"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"What gives? The reason the original code works is because the **index values** are the same.\n",
"```\n",
"df['data1'].groupby(df['key1']).mean()\n",
"```\n",
"\n",
"In the version above:\n",
"```\n",
"key1 = pd.Series(data=df['data1'].values, index=[5, 6, 7, 8, 9])\n",
"grouped_by_key1 = df['data1'].groupby(key1)\n",
"```\n",
"there's no match in `key1` from the index values in `df['data1']`, so we don't get any groups."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now let's look at grouping by a **function**. The function will be passed values of the index, so we'll transform our `mpg` dataset by using the `name` column values as the index."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mpg | \n",
" cylinders | \n",
" displacement | \n",
" horsepower | \n",
" weight | \n",
" acceleration | \n",
" model_year | \n",
" origin | \n",
"
\n",
" \n",
" name | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" chevrolet chevelle malibu | \n",
" 18.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 130.0 | \n",
" 3504 | \n",
" 12.0 | \n",
" 70 | \n",
" usa | \n",
"
\n",
" \n",
" buick skylark 320 | \n",
" 15.0 | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 3693 | \n",
" 11.5 | \n",
" 70 | \n",
" usa | \n",
"
\n",
" \n",
" plymouth satellite | \n",
" 18.0 | \n",
" 8 | \n",
" 318.0 | \n",
" 150.0 | \n",
" 3436 | \n",
" 11.0 | \n",
" 70 | \n",
" usa | \n",
"
\n",
" \n",
" amc rebel sst | \n",
" 16.0 | \n",
" 8 | \n",
" 304.0 | \n",
" 150.0 | \n",
" 3433 | \n",
" 12.0 | \n",
" 70 | \n",
" usa | \n",
"
\n",
" \n",
" ford torino | \n",
" 17.0 | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3449 | \n",
" 10.5 | \n",
" 70 | \n",
" usa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mpg cylinders displacement horsepower weight \\\n",
"name \n",
"chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 \n",
"buick skylark 320 15.0 8 350.0 165.0 3693 \n",
"plymouth satellite 18.0 8 318.0 150.0 3436 \n",
"amc rebel sst 16.0 8 304.0 150.0 3433 \n",
"ford torino 17.0 8 302.0 140.0 3449 \n",
"\n",
" acceleration model_year origin \n",
"name \n",
"chevrolet chevelle malibu 12.0 70 usa \n",
"buick skylark 320 11.5 70 usa \n",
"plymouth satellite 11.0 70 usa \n",
"amc rebel sst 12.0 70 usa \n",
"ford torino 10.5 70 usa "
]
},
"execution_count": 18,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"usa_model_index = usa.set_index('name')\n",
"usa_model_index.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now we can group by just the vehicle's **make** by creating a function that extracts the make from each index value."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevroelt', 'chevrolet', 'chevy', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])"
]
},
"execution_count": 19,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"def get_make(name):\n",
" stripper = name.strip()\n",
" parts = stripper.split()\n",
" return parts[0]\n",
"\n",
"grouped_by_make = usa_model_index.groupby(lambda name: name.strip().split()[0])\n",
"grouped_by_make.groups.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"We see some problems here: \n",
"\n",
"* **chevroelt** should be **chevrolet**\n",
"* **chevy** is an abbreviation for **chevrolet**. \n",
"\n",
"A lambda function can only have one expression within it. We'll write an actual function instead to allow us to treat **chevrolet**, **chevroelt**, and **chevy** the same."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
],
"source": [
"def get_make(name):\n",
" stripper = name.strip()\n",
" parts = stripper.split()\n",
" if parts[0] in [\"chevy\", \"chevrolet\", \"chevroelt\"]:\n",
" return \"chevrolet\"\n",
" else:\n",
" return parts[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Let's pass the `get_make` function as the argument to `groupby` and look at the resulting groups."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['amc', 'buick', 'cadillac', 'capri', 'chevrolet', 'chrysler', 'dodge', 'ford', 'hi', 'mercury', 'oldsmobile', 'plymouth', 'pontiac'])"
]
},
"execution_count": 21,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"grouped_by_make = usa_model_index.groupby(get_make)\n",
"grouped_by_make.groups.keys()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now find the average `mpg` for each make based on the keys returned by the `get_make` function."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name\n",
"amc 18.246429\n",
"buick 19.182353\n",
"cadillac 19.750000\n",
"capri 25.000000\n",
"chevrolet 20.219149\n",
"chrysler 17.266667\n",
"dodge 22.060714\n",
"ford 19.694118\n",
"hi 9.000000\n",
"mercury 19.118182\n",
"oldsmobile 21.100000\n",
"plymouth 21.703226\n",
"pontiac 20.012500\n",
"Name: mpg, dtype: float64\n",
"name\n",
"amc 18.246429\n",
"buick 19.182353\n",
"cadillac 19.750000\n",
"capri 25.000000\n",
"chevrolet 20.219149\n",
"chrysler 17.266667\n",
"dodge 22.060714\n",
"ford 19.694118\n",
"hi 9.000000\n",
"mercury 19.118182\n",
"oldsmobile 21.100000\n",
"plymouth 21.703226\n",
"pontiac 20.012500\n",
"Name: mpg, dtype: float64\n"
]
}
],
"source": [
"print(grouped_by_make.mean()['mpg'])\n",
"print(grouped_by_make['mpg'].mean())"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Get the average miles per gallon for the 3 makes that appear the **most often** in our data set. \n",
"\n",
"To get started, sort the result of calling the `size` aggregate function on `grouped_by_make` in descending order, storing the result in a variable named `most_frequently_occurring`."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['ford', 'chevrolet', 'plymouth'], dtype='object', name='name')"
]
},
"execution_count": 31,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"size_by_make = grouped_by_make.size()\n",
"most_frequently_occurring = size_by_make.sort_values(ascending=False)\n",
"makes_of_most_frequently_occurring = most_frequently_occurring.index[0:3]\n",
"makes_of_most_frequently_occurring"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now use the index values from `most_frequently_occurring` to select elements from `mean_mpg_by_make`."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"name\n",
"ford 19.694118\n",
"chevrolet 20.219149\n",
"plymouth 21.703226\n",
"Name: mpg, dtype: float64"
]
},
"execution_count": 32,
"metadata": {
},
"output_type": "execute_result"
}
],
"source": [
"mean_mpg_by_make = grouped_by_make.mean()['mpg']\n",
"mean_mpg_by_make.loc[makes_of_most_frequently_occurring]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"We can also create a new `DataFrame` from the `Series` objects `mean_mpg_by_make` and `size_by_make`, sort it using `sort_values`, and get the first 3 elements using `iloc`."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"collapsed": false
},
"outputs": [
],
"source": [
"summary = pd.DataFrame(\n",
" data = {\n",
" \"Avg. MPG\": mean_mpg_by_make,\n",
" \"Num. Vehicles\": size_by_make\n",
" }\n",
")\n",
"summary.sort_values('Num. Vehicles', ascending=False).iloc[0:3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Finally, look at summarizing by `Make` and `Year`. Get started by creating a multi-level index on those 2 columns."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"collapsed": false
},
"outputs": [
],
"source": [
"usa_model_year_idx = usa.set_index(['name', 'model_year'])\n",
"usa_model_year_idx.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Now we can group by a combination of the car's **make** and year"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"collapsed": false
},
"outputs": [
],
"source": [
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"And find the average miles per gallon for all vehicles manufactured by **Ford** in 1971."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"collapsed": false
},
"outputs": [
],
"source": [
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (system-wide)",
"language": "python",
"metadata": {
"cocalc": {
"description": "Python 3 programming language",
"priority": 100,
"url": "https://www.python.org/"
}
},
"name": "python3",
"resource_dir": "/ext/jupyter/kernels/python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}