Using groupby to summarize information
Import some data from the Seaborn sample data sets, and do some exploration.
Let's work with only cars made in the USA.
Look at what possible values are in the
origincolumn.Create a new
DataFramenamedusawith only those rows inmpgfor which the value of theorigincolumn is usa.Verify there is only 1 value in the
origincolumn in `usa.
Now let's specify that we want to create some groupings based on the model_year attribute and see what is contained in the groups property of the DataFrameGroupBy generated by calling groupby.
Let's see what the values associated with the key 70 in grouped.groups are.
Remember loc uses index values, not list positions. Let's prove that a minute. First we'll change the index values.
Now re-compute the groups based on the model_year attribute.
This hopefully proves that the values in the groups dictionary represent index values.
Now let's look at grouping by multiple attributes; in particular, we'll group by model_year followed by cylinders.
Now we could get the members of the 6-cylinder vehicles manufactured in 1970 like this
And we can get the average mpg for that specific set of vehicles by finding them using loc, and then using the Series method mean to compute the average
We can get the average mpg for every (model_year, cylinders) combination using the DataFrameGroupBy object's mean method.
The index of the DateFrame that comes back from DataFrameGroupBy.mean is a multi-level index.
Here's an example from our book that illustrates an important idea when passing a Series object as the field to group by. Consider the following DataFrame named df.
The book uses the following code to compute the average of the data1 column, grouped by the values in the key1 column.
Another way to write this is as follows:
Look carefully at the first example again, particularly focusing on what's being grouped:
Does this raise any questions in your head?
Consider the code in the following 2 cells.
What gives? The reason the original code works is because the index values are the same.
In the version above:
there's no match in key1 from the index values in df['data1'], so we don't get any groups.
Now let's look at grouping by a function. The function will be passed values of the index, so we'll transform our mpg dataset by using the name column values as the index.
Now we can group by just the vehicle's make by creating a function that extracts the make from each index value.
We see some problems here:
chevroelt should be chevrolet
chevy is an abbreviation for chevrolet.
A lambda function can only have one expression within it. We'll write an actual function instead to allow us to treat chevrolet, chevroelt, and chevy the same.
Let's pass the get_make function as the argument to groupby and look at the resulting groups.
Now find the average mpg for each make based on the keys returned by the get_make function.
Get the average miles per gallon for the 3 makes that appear the most often in our data set.
To get started, sort the result of calling the size aggregate function on grouped_by_make in descending order, storing the result in a variable named most_frequently_occurring.
Now use the index values from most_frequently_occurring to select elements from mean_mpg_by_make.
We can also create a new DataFrame from the Series objects mean_mpg_by_make and size_by_make, sort it using sort_values, and get the first 3 elements using iloc.
Finally, look at summarizing by Make and Year. Get started by creating a multi-level index on those 2 columns.
Now we can group by a combination of the car's make and year
And find the average miles per gallon for all vehicles manufactured by Ford in 1971.