GitHub Repository: ethen8181/machine-learning
Path: blob/master/python/pivot_table/pivot_table.ipynb
²⁵⁷⁷ views

Kernel: Python 3

In [1]:

# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))

from formats import load_style
load_style(plot_style = False)

Out[1]:

In [2]:

os.chdir(path)
import numpy as np
import pandas as pd

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload 
%autoreload 2

%watermark -a 'Ethen' -d -t -v -p numpy,pandas

Out[2]:

Ethen 2017-07-12 15:42:46 

CPython 3.5.2
IPython 5.4.1

numpy 1.13.1
pandas 0.20.2

Pandas's Pivot Table

Following the tutorial from the following link. Blog: Pandas pivot table explained.

The General rule of thumb is that once you use multiple grouby you should evaluate whether a pivot table is a useful approach.

One of the challenges with using the panda’s pivot_table is making sure you understand your data and what questions you are trying to answer with the pivot table. It is a seemingly simple function but can produce very powerful analysis very quickly. In this scenario, we'll be tracking a sales pipeline (also called funnel). The basic problem is that some sales cycles are very long (e.g. enterprise software, capital equipment, etc.) and the managemer wants to understand it in more detail throughout the year. Typical questions include:

How much revenue is in the pipeline?
What products are in the pipeline?
Who has what products at what stage?
How likely are we to close deals by year end?

Many companies will have CRM tools or other software that sales uses to track the process, while they may be useful tools for analyzing the data, inevitably someone will export the data to Excel and use a PivotTable to summarize the data. Using a panda’s pivot table can be a good alternative because it is:

Quicker (once it is set up)
Self documenting (look at the code and you know what it does)
Easy to use to generate a report or email
More flexible because you can define custome aggregation functions

In [3]:

df = pd.read_excel('sales-funnel.xlsx')
df.head()

Out[3]:

Pivot the Data

As we build up the pivot table, it's probably easiest to take one step at a time. Add items and check each step to verify you are getting the results you expect.

The simplest pivot table must have a dataframe and an index, which stands for the column that the data will be aggregated upon and values, which are the aggregated value.

In [4]:

df.pivot_table(index = ['Manager', 'Rep'], values = ['Price'])

Out[4]:

By default, the values will be averaged, but we can do a count or a sum by providing the aggfun parameter.

In [5]:

# you can provide multiple arguments to almost every argument of the pivot_table function
df.pivot_table(index = ['Manager', 'Rep'], values = ['Price'], aggfunc = [np.mean, len])

Out[5]:

If we want to see sales broken down by the products, the columns variable allows us to define one or more columns. Note: The confusing points with the pivot_table is the use of columns and values. Columns are optional - they provide an additional way to segment the actual values you care about. The aggregation functions are applied to the values you've listed.

In [6]:

df.pivot_table(index = ['Manager','Rep'], values = ['Price'],
               columns = ['Product'], aggfunc = [np.sum])

Out[6]:

The NaNs are a bit distracting. If we want to remove them, we could use fill_value to set them to 0.

In [7]:

df.pivot_table(index = ['Manager', 'Rep'], values = ['Price', 'Quantity'],
               columns = ['Product'], aggfunc = [np.sum], fill_value = 0)

Out[7]:

You can move items to the index to get a different visual representation. The following code chunk removes Product from the columns and add it to the index and also uses the margins = True parameter to add totals to the pivot table.

In [8]:

df.pivot_table(index = ['Manager', 'Rep', 'Product'],
               values = ['Price', 'Quantity'], aggfunc = [np.sum], margins = True)

Out[8]:

We can define the status column as a category and set the order we want in the pivot table.

In [9]:

df['Status'] = df['Status'].astype('category')
df['Status'] = df['Status'].cat.set_categories(['won', 'pending', 'presented', 'declined'])
df.pivot_table(index = ['Manager', 'Status'], values = ['Price'],
               aggfunc = [np.sum], fill_value = 0, margins = True)

Out[9]:

A really handy feature is the ability to pass a dictionary to the aggfunc so you can perform different functions on each of the values you select. This has a side-effect of making the labels a little cleaner.

In [10]:

table = df.pivot_table(index = ['Manager','Status'], 
                       columns = ['Product'], 
                       values = ['Quantity','Price'],
                       aggfunc = {'Quantity': len, 'Price': [np.sum, np.mean]}, 
                       fill_value = 0)
table

Out[10]:

Once you have generated your data, it is in a DataFrame so you can filter on it using your standard DataFrame functions. e.g. We can look at all of our pending and won deals.

In [11]:

# .query uses strings for boolean indexing and we don't have to 
# specify the dataframe that the Status is comming from
table.query("Status == ['pending','won']")

Out[11]:

Reference

Blog: Pandas pivot table explained

Table of Contents

Pandas's Pivot Table

Pivot the Data

Reference

Product

Resources

Company