Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_17/Forecasting a Time Series in Python.ipynb
1904 views
Kernel: Python 3

How to Forecast a Time Series with Python

Wouldn't it be nice to know the future? This is the notebook that relates to the blog post on medium. Please check the blog for visualizations and explanations, this notebook is really just for the code 😃

Processing the Data

Let's explore the Industrial production of electric and gas utilities in the United States, from the years 1985-2018, with our frequency being Monthly production output.

You can access this data here: https://fred.stlouisfed.org/series/IPG2211A2N

This data measures the real output of all relevant establishments located in the United States, regardless of their ownership, but not those located in U.S. territories.

%matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt data = pd.read_csv("data/Electric_Production.csv",index_col=0) data.head()

Right now our index is actually just a list of strings that look like a date, we'll want to adjust these to be timestamps, that way our forecasting analysis will be able to interpret these values:

data.index
Index(['1985-01-01', '1985-02-01', '1985-03-01', '1985-04-01', '1985-05-01', '1985-06-01', '1985-07-01', '1985-08-01', '1985-09-01', '1985-10-01', ... '2017-04-01', '2017-05-01', '2017-06-01', '2017-07-01', '2017-08-01', '2017-09-01', '2017-10-01', '2017-11-01', '2017-12-01', '2018-01-01'], dtype='object', name='DATE', length=397)
data.index = pd.to_datetime(data.index)
data.head()
data.index
DatetimeIndex(['1985-01-01', '1985-02-01', '1985-03-01', '1985-04-01', '1985-05-01', '1985-06-01', '1985-07-01', '1985-08-01', '1985-09-01', '1985-10-01', ... '2017-04-01', '2017-05-01', '2017-06-01', '2017-07-01', '2017-08-01', '2017-09-01', '2017-10-01', '2017-11-01', '2017-12-01', '2018-01-01'], dtype='datetime64[ns]', name='DATE', length=397, freq=None)

Let's first make sure that the data doesn't have any missing data points:

data[pd.isnull(data['IPG2211A2N'])]

Let's also rename this column since its hard to remember what "IPG2211A2N" code stands for:

data.columns = ['Energy Production']
data.head()
from statsmodels.tsa.seasonal import seasonal_decompose result = seasonal_decompose(data) result.plot()
/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead. from pandas.core import datetools
Image in a Jupyter notebookImage in a Jupyter notebook
#!pip install pyramid-arima
Requirement already satisfied: pyramid-arima in /anaconda3/lib/python3.6/site-packages Requirement already satisfied: Cython>=0.23 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima) Requirement already satisfied: statsmodels>=0.8 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima) Requirement already satisfied: numpy>=1.9 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima) Requirement already satisfied: scikit-learn>=0.17 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima) Requirement already satisfied: scipy>=0.9 in /anaconda3/lib/python3.6/site-packages (from pyramid-arima) You are using pip version 9.0.1, however version 10.0.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
from pyramid.arima import auto_arima

**he AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

stepwise_model = auto_arima(data, start_p=1, start_q=1, max_p=3, max_q=3, m=12, start_P=0, seasonal=True, d=1, D=1, trace=True, error_action='ignore', suppress_warnings=True, stepwise=True) # always set start_p/q=1
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=1782.527, BIC=1802.447, Fit time=2.309 seconds Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=nan, BIC=nan, Fit time=nan seconds Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=1942.040, BIC=1957.976, Fit time=0.394 seconds Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=1837.289, BIC=1853.224, Fit time=0.604 seconds Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=1783.914, BIC=1807.817, Fit time=2.105 seconds Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=1920.884, BIC=1936.820, Fit time=0.661 seconds Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=1784.107, BIC=1808.011, Fit time=4.474 seconds Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=1781.951, BIC=1809.839, Fit time=4.972 seconds Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=1837.164, BIC=1861.067, Fit time=1.684 seconds Fit ARIMA: order=(2, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=1782.659, BIC=1814.530, Fit time=5.882 seconds Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 2, 12); AIC=1852.587, BIC=1876.490, Fit time=1.398 seconds Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=1781.952, BIC=1813.824, Fit time=6.537 seconds Fit ARIMA: order=(0, 1, 0) seasonal_order=(1, 1, 2, 12); AIC=1864.184, BIC=1884.103, Fit time=1.650 seconds Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=1782.702, BIC=1818.557, Fit time=7.820 seconds Fit ARIMA: order=(1, 1, 1) seasonal_order=(2, 1, 2, 12); AIC=1772.548, BIC=1804.420, Fit time=6.194 seconds Fit ARIMA: order=(1, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=1771.295, BIC=1799.182, Fit time=3.995 seconds Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 0, 12); AIC=1870.049, BIC=1889.969, Fit time=1.895 seconds Fit ARIMA: order=(0, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=1825.210, BIC=1849.114, Fit time=1.828 seconds Fit ARIMA: order=(2, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=1772.010, BIC=1803.881, Fit time=5.743 seconds Fit ARIMA: order=(1, 1, 0) seasonal_order=(2, 1, 1, 12); AIC=1842.551, BIC=1866.454, Fit time=1.456 seconds Fit ARIMA: order=(1, 1, 2) seasonal_order=(2, 1, 1, 12); AIC=1771.724, BIC=1803.595, Fit time=6.972 seconds Fit ARIMA: order=(0, 1, 0) seasonal_order=(2, 1, 1, 12); AIC=1855.606, BIC=1875.526, Fit time=0.866 seconds Fit ARIMA: order=(2, 1, 2) seasonal_order=(2, 1, 1, 12); AIC=1773.026, BIC=1808.882, Fit time=7.823 seconds Fit ARIMA: order=(1, 1, 1) seasonal_order=(2, 1, 0, 12); AIC=1813.389, BIC=1837.292, Fit time=4.161 seconds Total fit time: 81.439 seconds
stepwise_model.aic() #you want the smallest AIC possible
1771.294930982826
stepwise_model
ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(1, 1, 1), out_of_sample_size=0, scoring='mse', scoring_args={}, seasonal_order=(2, 1, 1, 12), solver='lbfgs', start_params=None, suppress_warnings=True, transparams=True, trend='c')

Model Validation

  • Split "Train/ Test" (i.e. use earlier data to predict later data)

  • Examine Residuals to make sure that there is no autocorrelation

  • Compare against actual data

data.head()
data.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 397 entries, 1985-01-01 to 2018-01-01 Data columns (total 1 columns): Energy Production 397 non-null float64 dtypes: float64(1) memory usage: 6.2 KB

We'll train on 20 years of data, from the years 1985-2015 and test our forcast on the years after that and compare it to the real data.

train = data.loc['1985-01-01':'2016-12-01']
train.tail()
test = data.loc['2015-01-01':]
test.head()
test.tail()
len(test)
37
stepwise_model.fit(train)
ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(1, 1, 1), out_of_sample_size=0, scoring='mse', scoring_args={}, seasonal_order=(2, 1, 1, 12), solver='lbfgs', start_params=None, suppress_warnings=True, transparams=True, trend='c')
pd.Series(stepwise_model.resid()).plot() plt.ylim(-10, 10) plt.title('Residuals')
Text(0.5,1,'Residuals')
Image in a Jupyter notebook
from statsmodels.graphics.tsaplots import plot_acf plot_acf(stepwise_model.resid())
Image in a Jupyter notebookImage in a Jupyter notebook

Lets look at a close up of our predicted values versus the actual values

future_forecast = stepwise_model.predict(n_periods=37)
future_forecast
array([121.00743776, 110.05467257, 100.66047516, 90.6729823 , 92.16416883, 103.23036522, 112.50942838, 112.11508272, 101.04427752, 92.07885824, 95.83030458, 111.27080748, 120.21944313, 111.33123474, 102.17200219, 90.55932585, 92.13396728, 102.88777035, 111.87355708, 111.06690371, 100.84894434, 92.07432908, 95.82782063, 109.23249115, 119.35316143, 110.56528319, 100.99499069, 90.20977007, 91.75987078, 102.97587924, 112.21153054, 111.68598199, 101.10524767, 91.83726169, 95.09018796, 109.42419239, 119.3850524 ])
future_forecast = pd.DataFrame(future_forecast,index = test.index,columns=['Prediction'])
future_forecast.head()
test.head()
pd.concat([test,future_forecast],axis=1).plot() plt.title('Predicted vs Actual')
Text(0.5,1,'Predicted vs Actual')
Image in a Jupyter notebook
future_forecast2 = future_forecast
pd.concat([data,future_forecast2],axis=1).plot() plt.title('Predicted vs Actual, Long Term Trend')
Text(0.5,1,'Predicted vs Actual, Long Term Trend')
Image in a Jupyter notebook
stepwise_model
ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(1, 1, 1), out_of_sample_size=0, scoring='mse', scoring_args={}, seasonal_order=(2, 1, 1, 12), solver='lbfgs', start_params=None, suppress_warnings=True, transparams=True, trend='c')
from sklearn.metrics import mean_squared_error
mean_squared_error(test, future_forecast)
15.30960346411523

the interpretaion is relative to other models