Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-16/code/solution-code/solution-code-16.ipynb
1905 views
Kernel: Python 2
import pandas as pd import numpy as np %matplotlib inline

Walmart Sales Data

For the independent practice, we will analyze the weekly sales data from Walmart over a two year period from 2010 to 2012.

The data is again separated by store and by department, but we will focus on analyzing one store for simplicity.

The data includes:

  • Store - the store number

  • Dept - the department number

  • Date - the week

  • Weekly_Sales - sales for the given department in the given store

  • IsHoliday - whether the week is a special holiday week

Loading the data and setting the DateTimeIndex

data = pd.read_csv('../../assets/dataset/train.csv') data['Date'] = pd.to_datetime(data['Date']) data.set_index('Date', inplace=True) data.head()

Filter the dataframe to Store 1 sales and aggregate over departments to compute the total sales per store.

# Filter to store 1 sales and average over weeks store1_sales = data[data.Store == 1][['Weekly_Sales']].resample('W', 'sum') store1_sales.head()
pd.rolling_mean(store1_sales[['Weekly_Sales']], 3).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1120d7450>
Image in a Jupyter notebook

Compute the 1, 2, 52 autocorrelations for Weekly_Sales and/or create an autocorrelation plot.

print('Autocorrelation 1: ', store1_sales['Weekly_Sales'].autocorr(1)) print('Autocorrelation 3: ', store1_sales['Weekly_Sales'].autocorr(3)) print('Autocorrelation 52: ', store1_sales['Weekly_Sales'].autocorr(52))
('Autocorrelation 1: ', 0.30215827941131324) ('Autocorrelation 3: ', 0.059799235066717457) ('Autocorrelation 52: ', 0.89537602947770079)
from pandas.tools.plotting import autocorrelation_plot autocorrelation_plot(store1_sales['Weekly_Sales'])
<matplotlib.axes._subplots.AxesSubplot at 0x111e58050>
Image in a Jupyter notebook
from statsmodels.graphics.tsaplots import plot_acf plot_acf(store1_sales['Weekly_Sales'], lags=30) # Components 1 and 2 seem particularly useful for autoregression, perhaps up to 4 # In the plot above notice, spike at around 52 - implying a yearly pattern as well # No random spikes, probably not much use for a moving average model
Image in a Jupyter notebookImage in a Jupyter notebook

Split the weekly sales data in a training and test set - using 75% of the data for training

n = len(store1_sales.Weekly_Sales) train = store1_sales.Weekly_Sales[:int(.75*n)] test = store1_sales.Weekly_Sales[int(.75*n):]

Create an AR(1) model on the training data and compute the mean absolute error of the predictions.

import statsmodels.api as sm from sklearn.metrics import mean_absolute_error
model = sm.tsa.ARIMA(train, (1, 0, 0)).fit() predictions = model.predict( '2012-02-27', '2012-10-29', dynamic=True, ) print("Mean absolute error: ", mean_absolute_error(test, predictions)) model.summary()
('Mean absolute error: ', 81839.338629691949)
/Users/arahuja/anaconda/lib/python2.7/site-packages/statsmodels/base/data.py:503: FutureWarning: TimeSeries is deprecated. Please use Series return TimeSeries(result, index=self.predict_dates)

Plot the residuals - where are their significant errors.

model.resid.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x119d926d0>
Image in a Jupyter notebook
plot_acf(model.resid, lags=30)
Image in a Jupyter notebookImage in a Jupyter notebook

Compute and AR(2) model and an ARMA(2, 2) model - does this improve your mean absolute error on the held out set.

model = sm.tsa.ARIMA(train, (2, 0, 0)).fit() predictions = model.predict( '2012-02-27', '2012-10-29', dynamic=True, ) print("Mean absolute error: ", mean_absolute_error(test, predictions)) model.summary()
('Mean absolute error: ', 81203.240909485947)
model = sm.tsa.ARIMA(train, (2, 0, 2)).fit() predictions = model.predict( '2012-02-27', '2012-10-29', dynamic=True, ) print("Mean absolute error: ", mean_absolute_error(test, predictions)) model.summary()
('Mean absolute error: ', 80502.745386798299)

Finally, compute an ARIMA model to improve your prediction error - iterate on the p, q, and parameters comparing the model's performance.

model = sm.tsa.ARIMA(train, (2, 1, 3)).fit() predictions = model.predict( '2012-02-27', '2012-10-29', dynamic=False, typ='levels' ) print("Mean absolute error: ", mean_absolute_error(test, predictions)) model.summary()
('Mean absolute error: ', 77789.494825392394)
/Users/arahuja/anaconda/lib/python2.7/site-packages/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)