CoCalc -- solution-code-16.ipynb

GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-16/code/solution-code/solution-code-16.ipynb
¹⁹⁰⁵ views

Kernel: Python 2

In [3]:

import pandas as pd
import numpy as np

%matplotlib inline

Walmart Sales Data

For the independent practice, we will analyze the weekly sales data from Walmart over a two year period from 2010 to 2012.

The data is again separated by store and by department, but we will focus on analyzing one store for simplicity.

The data includes:

Store - the store number
Dept - the department number
Date - the week
Weekly_Sales - sales for the given department in the given store
IsHoliday - whether the week is a special holiday week

Loading the data and setting the DateTimeIndex

In [5]:

data = pd.read_csv('../../assets/dataset/train.csv')
data['Date'] = pd.to_datetime(data['Date'])
data.set_index('Date', inplace=True)
data.head()

Out[5]:

Filter the dataframe to Store 1 sales and aggregate over departments to compute the total sales per store.

In [6]:

# Filter to store 1 sales and average over weeks
store1_sales = data[data.Store == 1][['Weekly_Sales']].resample('W', 'sum')
store1_sales.head()

Out[6]:

Plot the rolling_mean for `Weekly_Sales`. What general trends do you observe?

In [7]:

pd.rolling_mean(store1_sales[['Weekly_Sales']], 3).plot()

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x1120d7450>

Compute the 1, 2, 52 autocorrelations for `Weekly_Sales` and/or create an autocorrelation plot.

In [8]:

print('Autocorrelation 1: ', store1_sales['Weekly_Sales'].autocorr(1))
print('Autocorrelation 3: ', store1_sales['Weekly_Sales'].autocorr(3))
print('Autocorrelation 52: ', store1_sales['Weekly_Sales'].autocorr(52))

Out[8]:

('Autocorrelation 1: ', 0.30215827941131324)
('Autocorrelation 3: ', 0.059799235066717457)
('Autocorrelation 52: ', 0.89537602947770079)

In [9]:

from pandas.tools.plotting import autocorrelation_plot

autocorrelation_plot(store1_sales['Weekly_Sales'])

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x111e58050>

In [10]:

from statsmodels.graphics.tsaplots import plot_acf

plot_acf(store1_sales['Weekly_Sales'], lags=30)

# Components 1 and 2 seem particularly useful for autoregression, perhaps up to 4
# In the plot above notice, spike at around 52 - implying a yearly pattern as well
# No random spikes, probably not much use for a moving average model

Out[10]:

Split the weekly sales data in a training and test set - using 75% of the data for training

In [11]:

n = len(store1_sales.Weekly_Sales)

train = store1_sales.Weekly_Sales[:int(.75*n)]
test = store1_sales.Weekly_Sales[int(.75*n):]

Create an AR(1) model on the training data and compute the mean absolute error of the predictions.

In [12]:

import statsmodels.api as sm
from sklearn.metrics import mean_absolute_error

In [13]:

model = sm.tsa.ARIMA(train, (1, 0, 0)).fit()

predictions = model.predict(
    '2012-02-27',
    '2012-10-29',
    dynamic=True, 
)

print("Mean absolute error: ", mean_absolute_error(test, predictions))
model.summary()

Out[13]:

('Mean absolute error: ', 81839.338629691949)

/Users/arahuja/anaconda/lib/python2.7/site-packages/statsmodels/base/data.py:503: FutureWarning: TimeSeries is deprecated. Please use Series
  return TimeSeries(result, index=self.predict_dates)

Plot the residuals - where are their significant errors.

In [14]:

model.resid.plot()

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x119d926d0>

In [15]:

plot_acf(model.resid, lags=30)

Out[15]:

Compute and AR(2) model and an ARMA(2, 2) model - does this improve your mean absolute error on the held out set.

In [16]:

model = sm.tsa.ARIMA(train, (2, 0, 0)).fit()

predictions = model.predict(
    '2012-02-27',
    '2012-10-29',
    dynamic=True, 
)

print("Mean absolute error: ", mean_absolute_error(test, predictions))
model.summary()

Out[16]:

('Mean absolute error: ', 81203.240909485947)

In [17]:

model = sm.tsa.ARIMA(train, (2, 0, 2)).fit()

predictions = model.predict(
    '2012-02-27',
    '2012-10-29',
    dynamic=True, 
)

print("Mean absolute error: ", mean_absolute_error(test, predictions))
model.summary()

Out[17]:

('Mean absolute error: ', 80502.745386798299)

Finally, compute an ARIMA model to improve your prediction error - iterate on the p, q, and parameters comparing the model's performance.

In [18]:

model = sm.tsa.ARIMA(train, (2, 1, 3)).fit()

predictions = model.predict(
    '2012-02-27',
    '2012-10-29',
    dynamic=False, 
    typ='levels'
)

print("Mean absolute error: ", mean_absolute_error(test, predictions))
model.summary()

Out[18]:

('Mean absolute error: ', 77789.494825392394)

/Users/arahuja/anaconda/lib/python2.7/site-packages/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

In [ ]:

Walmart Sales Data

Loading the data and setting the DateTimeIndex

Filter the dataframe to Store 1 sales and aggregate over departments to compute the total sales per store.

Plot the rolling_mean for `Weekly_Sales`. What general trends do you observe?

Compute the 1, 2, 52 autocorrelations for `Weekly_Sales` and/or create an autocorrelation plot.

Split the weekly sales data in a training and test set - using 75% of the data for training

Create an AR(1) model on the training data and compute the mean absolute error of the predictions.

Plot the residuals - where are their significant errors.

Compute and AR(2) model and an ARMA(2, 2) model - does this improve your mean absolute error on the held out set.

Finally, compute an ARIMA model to improve your prediction error - iterate on the p, q, and parameters comparing the model's performance.

Product

Resources

Company

Walmart Sales Data

Loading the data and setting the DateTimeIndex

Filter the dataframe to Store 1 sales and aggregate over departments to compute the total sales per store.

Plot the rolling_mean for Weekly_Sales. What general trends do you observe?

Compute the 1, 2, 52 autocorrelations for Weekly_Sales and/or create an autocorrelation plot.

Split the weekly sales data in a training and test set - using 75% of the data for training

Create an AR(1) model on the training data and compute the mean absolute error of the predictions.

Plot the residuals - where are their significant errors.

Compute and AR(2) model and an ARMA(2, 2) model - does this improve your mean absolute error on the held out set.

Finally, compute an ARIMA model to improve your prediction error - iterate on the p, q, and parameters comparing the model's performance.

Plot the rolling_mean for `Weekly_Sales`. What general trends do you observe?

Compute the 1, 2, 52 autocorrelations for `Weekly_Sales` and/or create an autocorrelation plot.