GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-16/L16-Demo.ipynb
¹⁹⁰⁴ views

Kernel: Python [Root]

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt

To explore time series models, we will continue with the Rossmann sales data. This dataset has sales data for sales at every Rossmann store for a 3-year period, as well indicators of holidays and basic store information.

In the last class, we saw that we would plot the sales data at a particular store to identify how the sales changed over time. Additionally, we computed autocorrelation for the data at varying lag periods. This helps us identify if previous timepoints are predictive of future data and which time points are most important - the previous day? week? month?

In [2]:

import pandas as pd

# Load the data and set the DateTime index
data = pd.read_csv('../../lessons/lesson-15/assets/dataset/rossmann.csv', skipinitialspace=True)

data['Date'] = pd.to_datetime(data['Date'])
data.set_index('Date', inplace=True)

# Filter to Store 1
store1_data = data[data.Store == 1]

# Filter to open days
store1_open_data = store1_data[store1_data.Open==1]

# Plot the sales over time
store1_open_data[['Sales']].plot()

Out[2]:

/home/user/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

<matplotlib.axes._subplots.AxesSubplot at 0x7f5a3d40c6d0>

Check Compute the autocorrelation of Sales in Store 1 for lag 1 and 2. Will we be able to use a predictive model - particularly an autoregressive one?

In [3]:

store1_data.Sales.autocorr(lag=1) # -0.12

Out[3]:

-0.12732514339140219

In [4]:

store1_data.Sales.autocorr(lag=2) # -0.03

Out[4]:

-0.034787155707946972

Pandas and statsmodels both provide convenience plots for autocorrelations.

In [5]:

from pandas.tools.plotting import autocorrelation_plot

autocorrelation_plot(store1_data.Sales)

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f5a42e9fed0>

In [6]:

from statsmodels.graphics.tsaplots import plot_acf

plot_acf(store1_data.Sales, lags=10)
plt.show()

Out[6]:

Check: What caused the spike at 7?

ARMA Model

Recall that ARMA(p, q) models are a sum of an AR(p) and a MA(q) model. So if we want just an AR(p) model we use and ARMA(p, 0) model.

In [7]:

from statsmodels.tsa.arima_model import ARMA

store1_sales_data = store1_open_data[['Sales']].astype(float)
model = ARMA(store1_sales_data, (1, 0)).fit()
print model.summary()

Out[7]:

                              ARMA Model Results                              
==============================================================================
Dep. Variable:                  Sales   No. Observations:                  781
Model:                     ARMA(1, 0)   Log Likelihood               -6267.326
Method:                       css-mle   S.D. of innovations            739.079
Date:                Wed, 27 Jul 2016   AIC                          12540.651
Time:                        18:27:13   BIC                          12554.633
Sample:                    07-31-2015   HQIC                         12546.029
                         - 01-02-2013                                         
===============================================================================
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
const        4762.6173     82.986     57.391      0.000      4599.969  4925.266
ar.L1.Sales     0.6822      0.026     26.122      0.000         0.631     0.733
                                    Roots                                    
=============================================================================
                 Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.4659           +0.0000j            1.4659            0.0000
-----------------------------------------------------------------------------

In [ ]:

model = ARMA(store1_sales_data, (2, 0)).fit()
print model.summary()

                              ARMA Model Results                              
==============================================================================
Dep. Variable:                  Sales   No. Observations:                  781
Model:                     ARMA(2, 0)   Log Likelihood               -6267.032
Method:                       css-mle   S.D. of innovations            738.800
Date:                Wed, 27 Jul 2016   AIC                          12542.063
Time:                        18:27:13   BIC                          12560.705
Sample:                    07-31-2015   HQIC                         12549.233
                         - 01-02-2013                                         
===============================================================================
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
const        4762.3980     85.262     55.856      0.000      4595.287  4929.509
ar.L1.Sales     0.6634      0.036     18.537      0.000         0.593     0.734
ar.L2.Sales     0.0275      0.036      0.767      0.443        -0.043     0.098
                                    Roots                                    
=============================================================================
                 Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.4235           +0.0000j            1.4235            0.0000
AR.2          -25.5833           +0.0000j           25.5833            0.5000
-----------------------------------------------------------------------------

Just like with other types of regression, we can compute the model residuals.

Check: What are residuals? In linear regression, what did we expect of residuals?

In [ ]:

model.resid.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f5a2a3e5110>

In [ ]:

plot_acf(model.resid, lags=50)
plt.show()

Becuase of the errors, it doesn't look like an AR model is good enough -- the data isn't stationary. So let's expand to an ARMA model.

In [ ]:

model = ARMA(store1_sales_data, (1, 1)).fit()
print model.summary()

In [ ]:

from statsmodels.tsa.arima_model import ARIMA

model = ARIMA(store1_sales_data, (2, 0, 2)).fit()
print model.summary()

In [ ]:

model = ARIMA(store1_sales_data, (2, 1, 2)).fit()
print model.summary()

In [ ]:

model = ARIMA(store1_sales_data, (2, 1, 0)).fit()
print model.summary()

In [ ]:

store1_sales_data.Sales.diff(1).autocorr(1) #-0.181

store1_sales_data.Sales.diff(1).plot()
plt.show()

In [ ]:

model.plot_predict(1, 50)

In [ ]:

fig, ax = plt.subplots()
ax = store1_sales_data['2014'].plot(ax=ax)

fig = model.plot_predict(1, 200, ax=ax, plot_insample=False)

In [ ]:

model = ARIMA(store1_sales_data, (7, 1, 2)).fit()
model.summary()

plot_acf(model.resid, lags=50)
plt.show()

In [ ]:

ARMA Model

Product

Resources

Company