Path: blob/master/april_18/lessons/lesson-15/code/starter-code/starter-code-15.ipynb
1905 views
Exploring Rossmann Drug Store Sales Data
Because we are most interested in the Date
column that contains the date of sales for each store, we will make sure to process that as a DateTime
type, and make that the index of our dataframe.
Data Exploration and MINEing
To compare sales on holidays, we can compare the sales using box-plots, which allows us to compare the distribution of sales on holidays against all other days. On state holidays, the store is closed (and as a nice sanity check there are 0 sales), and on school holidays the sales are relatively similar.
Check: See if there is a difference affecting sales on promotion days.
Lastly, we want to identify larger-scale trends in our data. How did sales change from 2014 to 2015? Were there any particularly interesting outliers in terms of sales or customer visits?
In pandas, we can compute rolling average using the pd.rolling_mean
or pd.rolling_median
functions.
Data REFINING Using Time Series Statistics
Autocorrelation
To measure how much the sales are correlated with each other, we want to compute the autocorrelation of the 'Sales' column. In pandas, we do this we with the autocorr
function:
autocorr
takes one argument, the lag
which is how many prio data points should be used to compute the correlation. If we set the lag
to 1, we compute the correlation between every point and the point directly preceding it. Setting lag
to 10 computes the correlation between every point and the point 10 days earlier.
Rolling Averages
If we want to investigate trends over time in sales, as always, we will start by computing simple aggregates. What were the mean and median sales for each month and year?
In Pandas, this is performed using the resample
command, which is very similar to the groupby
command. It allows us to group over different time intervals.
We can use data.resample
and provide as arguments: - The level on which to roll-up to, 'D' for day, 'W' for week, 'M' for month, 'A' for year - What aggregation to perform: 'mean', 'median', 'sum', etc.
While identifying the monthly averages is useful, we often want to compare the sales data of a date to a smaller window. To understand holidays sales, we don't want to compare late December with the entire month, but perhaps with a few days surrounding it. We can do this using rolling averages.
In pandas, we can compute rolling average using the pd.rolling_mean
or pd.rolling_median
functions.
rolling_mean
(as well as rolling_median
) takes these important parameters: - the first is the series to aggregate - window
is the number of days to include in the average - center
is whether the window should be centered on the date or use data prior to that date - freq
is on what level to roll-up the averages to (as used in resample
). Either D
for day, M
for month or A
for year, etc.
Instead of plotting the full timeseries, we can plot the rolling mean instead, which smooths random changes in sales as well as removing outliers, helping us identify larger trends.
Pandas Window functions
Pandas rolling_mean
and rolling_median
are only two examples of Pandas window function capabilities. Window functions are operate on a set of N consecutive rows (a window) and produce an output.
In addition to rolling_mean
and rolling_median
, there are rolling_sum
, rolling_min
, rolling_max
... and many more.
Another common one is diff
, which takes the difference over time. pd.diff
takes one arugment, periods
, which is how many rows prior to use for the difference.
Pandas expanding functions
In addition to the set of rolling_*
functions, Pandas also provides a similar collection of expanding_*
functions, which, instead of using a window of N values, use all values up until that time.
Exercises
Plot the distribution of sales by month and compare the effect of promotions
Are sales more correlated with the prior date, a similar date last year, or a similar date last month?
Plot the 15 day rolling mean of customers in the stores
Identify the date with largest drop in sales from the same date in the previous month
Compute the total sales up until Dec. 2014
When were the largest differences between 15-day moving/rolling averages? HINT: Using
rolling_mean
anddiff