GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_17/Forecasting a Time Series in Python.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Source: https://medium.com/@josemarcialportilla/using-python-and-auto-arima-to-forecast-seasonal-time-series-90877adff03c

How to Forecast a Time Series with Python

Wouldn't it be nice to know the future? This is the notebook that relates to the blog post on medium. Please check the blog for visualizations and explanations, this notebook is really just for the code 😃

Processing the Data

Let's explore the Industrial production of electric and gas utilities in the United States, from the years 1985-2018, with our frequency being Monthly production output.

You can access this data here: https://fred.stlouisfed.org/series/IPG2211A2N

This data measures the real output of all relevant establishments located in the United States, regardless of their ownership, but not those located in U.S. territories.

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("data/Electric_Production.csv",index_col=0)
data.head()

Out[1]:

Right now our index is actually just a list of strings that look like a date, we'll want to adjust these to be timestamps, that way our forecasting analysis will be able to interpret these values:

In [2]:

data.index

Out[2]:

Index(['1985-01-01', '1985-02-01', '1985-03-01', '1985-04-01', '1985-05-01',
       '1985-06-01', '1985-07-01', '1985-08-01', '1985-09-01', '1985-10-01',
       ...
       '2017-04-01', '2017-05-01', '2017-06-01', '2017-07-01', '2017-08-01',
       '2017-09-01', '2017-10-01', '2017-11-01', '2017-12-01', '2018-01-01'],
      dtype='object', name='DATE', length=397)

In [3]:

data.index = pd.to_datetime(data.index)

In [4]:

data.head()

Out[4]:

In [5]:

data.index

Out[5]:

DatetimeIndex(['1985-01-01', '1985-02-01', '1985-03-01', '1985-04-01',
               '1985-05-01', '1985-06-01', '1985-07-01', '1985-08-01',
               '1985-09-01', '1985-10-01',
               ...
               '2017-04-01', '2017-05-01', '2017-06-01', '2017-07-01',
               '2017-08-01', '2017-09-01', '2017-10-01', '2017-11-01',
               '2017-12-01', '2018-01-01'],
              dtype='datetime64[ns]', name='DATE', length=397, freq=None)

Let's first make sure that the data doesn't have any missing data points:

In [6]:

data[pd.isnull(data['IPG2211A2N'])]

Out[6]:

Let's also rename this column since its hard to remember what "IPG2211A2N" code stands for:

In [7]:

data.columns = ['Energy Production']

In [8]:

data.head()

Out[8]:

In [9]:

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(data)
result.plot()

Out[9]:

/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

In [11]:

#!pip install pyramid-arima

Out[11]:

Requirement already satisfied: pyramid-arima in /anaconda3/lib/python3.6/site-packages
Requirement already satisfied: Cython>=0.23 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima)
Requirement already satisfied: statsmodels>=0.8 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima)
Requirement already satisfied: numpy>=1.9 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima)
Requirement already satisfied: scikit-learn>=0.17 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima)
Requirement already satisfied: scipy>=0.9 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima)
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [13]:

from pyramid.arima import auto_arima

**he AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

In [14]:

stepwise_model = auto_arima(data, start_p=1, start_q=1,
                           max_p=3, max_q=3, m=12,
                           start_P=0, seasonal=True,
                           d=1, D=1, trace=True,
                           error_action='ignore',  
                           suppress_warnings=True, 
                           stepwise=True)

# always set start_p/q=1

Out[14]:

Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=1782.527, BIC=1802.447, Fit time=2.309 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=1942.040, BIC=1957.976, Fit time=0.394 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=1837.289, BIC=1853.224, Fit time=0.604 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=1783.914, BIC=1807.817, Fit time=2.105 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=1920.884, BIC=1936.820, Fit time=0.661 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=1784.107, BIC=1808.011, Fit time=4.474 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=1781.951, BIC=1809.839, Fit time=4.972 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=1837.164, BIC=1861.067, Fit time=1.684 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=1782.659, BIC=1814.530, Fit time=5.882 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 2, 12); AIC=1852.587, BIC=1876.490, Fit time=1.398 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=1781.952, BIC=1813.824, Fit time=6.537 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(1, 1, 2, 12); AIC=1864.184, BIC=1884.103, Fit time=1.650 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=1782.702, BIC=1818.557, Fit time=7.820 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(2, 1, 2, 12); AIC=1772.548, BIC=1804.420, Fit time=6.194 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=1771.295, BIC=1799.182, Fit time=3.995 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 0, 12); AIC=1870.049, BIC=1889.969, Fit time=1.895 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=1825.210, BIC=1849.114, Fit time=1.828 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=1772.010, BIC=1803.881, Fit time=5.743 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(2, 1, 1, 12); AIC=1842.551, BIC=1866.454, Fit time=1.456 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(2, 1, 1, 12); AIC=1771.724, BIC=1803.595, Fit time=6.972 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(2, 1, 1, 12); AIC=1855.606, BIC=1875.526, Fit time=0.866 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(2, 1, 1, 12); AIC=1773.026, BIC=1808.882, Fit time=7.823 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(2, 1, 0, 12); AIC=1813.389, BIC=1837.292, Fit time=4.161 seconds
Total fit time: 81.439 seconds

In [16]:

stepwise_model.aic()
#you want the smallest AIC possible

Out[16]:

1771.294930982826

In [18]:

stepwise_model

Out[18]:

ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(1, 1, 1),
   out_of_sample_size=0, scoring='mse', scoring_args={},
   seasonal_order=(2, 1, 1, 12), solver='lbfgs', start_params=None,
   suppress_warnings=True, transparams=True, trend='c')

Model Validation

Split "Train/ Test" (i.e. use earlier data to predict later data)
Examine Residuals to make sure that there is no autocorrelation
Compare against actual data

In [19]:

data.head()

Out[19]:

In [20]:

data.info()

Out[20]:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 397 entries, 1985-01-01 to 2018-01-01
Data columns (total 1 columns):
Energy Production    397 non-null float64
dtypes: float64(1)
memory usage: 6.2 KB

We'll train on 20 years of data, from the years 1985-2015 and test our forcast on the years after that and compare it to the real data.

In [34]:

train = data.loc['1985-01-01':'2016-12-01']

In [35]:

train.tail()

Out[35]:

In [36]:

test = data.loc['2015-01-01':]

In [37]:

test.head()

Out[37]:

In [38]:

test.tail()

Out[38]:

In [39]:

len(test)

Out[39]:

37

In [40]:

stepwise_model.fit(train)

Out[40]:

ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(1, 1, 1),
   out_of_sample_size=0, scoring='mse', scoring_args={},
   seasonal_order=(2, 1, 1, 12), solver='lbfgs', start_params=None,
   suppress_warnings=True, transparams=True, trend='c')

In [41]:

pd.Series(stepwise_model.resid()).plot()
plt.ylim(-10, 10)
plt.title('Residuals')

Out[41]:

Text(0.5,1,'Residuals')

In [42]:

from statsmodels.graphics.tsaplots import plot_acf
plot_acf(stepwise_model.resid())

Out[42]:

Lets look at a close up of our predicted values versus the actual values

In [43]:

future_forecast = stepwise_model.predict(n_periods=37)

In [44]:

future_forecast

Out[44]:

array([121.00743776, 110.05467257, 100.66047516,  90.6729823 ,
        92.16416883, 103.23036522, 112.50942838, 112.11508272,
       101.04427752,  92.07885824,  95.83030458, 111.27080748,
       120.21944313, 111.33123474, 102.17200219,  90.55932585,
        92.13396728, 102.88777035, 111.87355708, 111.06690371,
       100.84894434,  92.07432908,  95.82782063, 109.23249115,
       119.35316143, 110.56528319, 100.99499069,  90.20977007,
        91.75987078, 102.97587924, 112.21153054, 111.68598199,
       101.10524767,  91.83726169,  95.09018796, 109.42419239,
       119.3850524 ])

In [45]:

future_forecast = pd.DataFrame(future_forecast,index = test.index,columns=['Prediction'])

In [46]:

future_forecast.head()

Out[46]:

In [47]:

test.head()

Out[47]:

In [48]:

pd.concat([test,future_forecast],axis=1).plot()
plt.title('Predicted vs Actual')

Out[48]:

Text(0.5,1,'Predicted vs Actual')

In [49]:

future_forecast2 = future_forecast

In [50]:

pd.concat([data,future_forecast2],axis=1).plot()
plt.title('Predicted vs Actual, Long Term Trend')

Out[50]:

Text(0.5,1,'Predicted vs Actual, Long Term Trend')

In [52]:

stepwise_model

Out[52]:

ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(1, 1, 1),
   out_of_sample_size=0, scoring='mse', scoring_args={},
   seasonal_order=(2, 1, 1, 12), solver='lbfgs', start_params=None,
   suppress_warnings=True, transparams=True, trend='c')

In [54]:

from sklearn.metrics import mean_squared_error

In [55]:

mean_squared_error(test, future_forecast)

Out[55]:

15.30960346411523

the interpretaion is relative to other models

How to Forecast a Time Series with Python

Processing the Data

Model Validation

Lets look at a close up of our predicted values versus the actual values

Product

Resources

Company