Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-16/L16-Demo.ipynb
1904 views
Kernel: Python [Root]
%matplotlib inline import matplotlib.pyplot as plt

To explore time series models, we will continue with the Rossmann sales data. This dataset has sales data for sales at every Rossmann store for a 3-year period, as well indicators of holidays and basic store information.

In the last class, we saw that we would plot the sales data at a particular store to identify how the sales changed over time. Additionally, we computed autocorrelation for the data at varying lag periods. This helps us identify if previous timepoints are predictive of future data and which time points are most important - the previous day? week? month?

import pandas as pd # Load the data and set the DateTime index data = pd.read_csv('../../lessons/lesson-15/assets/dataset/rossmann.csv', skipinitialspace=True) data['Date'] = pd.to_datetime(data['Date']) data.set_index('Date', inplace=True) # Filter to Store 1 store1_data = data[data.Store == 1] # Filter to open days store1_open_data = store1_data[store1_data.Open==1] # Plot the sales over time store1_open_data[['Sales']].plot()
/home/user/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5a3d40c6d0>
Image in a Jupyter notebook

Check Compute the autocorrelation of Sales in Store 1 for lag 1 and 2. Will we be able to use a predictive model - particularly an autoregressive one?

store1_data.Sales.autocorr(lag=1) # -0.12
-0.12732514339140219
store1_data.Sales.autocorr(lag=2) # -0.03
-0.034787155707946972

Pandas and statsmodels both provide convenience plots for autocorrelations.

from pandas.tools.plotting import autocorrelation_plot autocorrelation_plot(store1_data.Sales)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5a42e9fed0>
Image in a Jupyter notebook
from statsmodels.graphics.tsaplots import plot_acf plot_acf(store1_data.Sales, lags=10) plt.show()
Image in a Jupyter notebook

Check: What caused the spike at 7?

ARMA Model

Recall that ARMA(p, q) models are a sum of an AR(p) and a MA(q) model. So if we want just an AR(p) model we use and ARMA(p, 0) model.

from statsmodels.tsa.arima_model import ARMA store1_sales_data = store1_open_data[['Sales']].astype(float) model = ARMA(store1_sales_data, (1, 0)).fit() print model.summary()
ARMA Model Results ============================================================================== Dep. Variable: Sales No. Observations: 781 Model: ARMA(1, 0) Log Likelihood -6267.326 Method: css-mle S.D. of innovations 739.079 Date: Wed, 27 Jul 2016 AIC 12540.651 Time: 18:27:13 BIC 12554.633 Sample: 07-31-2015 HQIC 12546.029 - 01-02-2013 =============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------- const 4762.6173 82.986 57.391 0.000 4599.969 4925.266 ar.L1.Sales 0.6822 0.026 26.122 0.000 0.631 0.733 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 1.4659 +0.0000j 1.4659 0.0000 -----------------------------------------------------------------------------
model = ARMA(store1_sales_data, (2, 0)).fit() print model.summary()
ARMA Model Results ============================================================================== Dep. Variable: Sales No. Observations: 781 Model: ARMA(2, 0) Log Likelihood -6267.032 Method: css-mle S.D. of innovations 738.800 Date: Wed, 27 Jul 2016 AIC 12542.063 Time: 18:27:13 BIC 12560.705 Sample: 07-31-2015 HQIC 12549.233 - 01-02-2013 =============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------- const 4762.3980 85.262 55.856 0.000 4595.287 4929.509 ar.L1.Sales 0.6634 0.036 18.537 0.000 0.593 0.734 ar.L2.Sales 0.0275 0.036 0.767 0.443 -0.043 0.098 Roots ============================================================================= Real Imaginary Modulus Frequency ----------------------------------------------------------------------------- AR.1 1.4235 +0.0000j 1.4235 0.0000 AR.2 -25.5833 +0.0000j 25.5833 0.5000 -----------------------------------------------------------------------------

Just like with other types of regression, we can compute the model residuals.

Check: What are residuals? In linear regression, what did we expect of residuals?

model.resid.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f5a2a3e5110>
plot_acf(model.resid, lags=50) plt.show()

Becuase of the errors, it doesn't look like an AR model is good enough -- the data isn't stationary. So let's expand to an ARMA model.

model = ARMA(store1_sales_data, (1, 1)).fit() print model.summary()
from statsmodels.tsa.arima_model import ARIMA model = ARIMA(store1_sales_data, (2, 0, 2)).fit() print model.summary()
model = ARIMA(store1_sales_data, (2, 1, 2)).fit() print model.summary()
model = ARIMA(store1_sales_data, (2, 1, 0)).fit() print model.summary()
store1_sales_data.Sales.diff(1).autocorr(1) #-0.181 store1_sales_data.Sales.diff(1).plot() plt.show()
model.plot_predict(1, 50)
fig, ax = plt.subplots() ax = store1_sales_data['2014'].plot(ax=ax) fig = model.plot_predict(1, 200, ax=ax, plot_insample=False)
model = ARIMA(store1_sales_data, (7, 1, 2)).fit() model.summary() plot_acf(model.resid, lags=50) plt.show()