Path: blob/master/lessons/lesson_16/03_autocorrelation.ipynb
1904 views
Time Series: Autocorrelation
Learning Objectives
After this lesson, you will be able to:
Define autocorrelation and list some real-world examples.
Use the Pandas
autocorr()
function to compute autocorrelation.Calculate and plot the ACF and PACF using StatsModels and Pandas.
Explain why autocorrelation poses a problem for models that assume independence.
While in previous weeks, our analyses has been concerned with the correlation between two or more variables (height and weight, education and salary, etc.), in time series data, autocorrelation is a measure of how correlated a variable is with itself.
Specifically, autocorrelation measures how closely related earlier values are with values that occur later in time.
Examples of autocorrelation include:
Check: What are some examples of autocorrelation that you can think of in the real world?
How Do We Compute Autocorrelation?
To compute autocorrelation, we fix a lag, k, which is the delta between the given point and the prior point used to compute the correlation.
With a k value of one, we'd compute how correlated a value is with the prior one. With a k value of 10, we'd compute how correlated a variable is with one that's 10 time points earlier.
Note that this formula assumes stationarity, which we'll discuss shortly.
Guided Practice
Last section, we looked at the Rossman Drugstore data to learn how to handle time series data in Pandas. We'll use this same data set to look for autocorrelation.
We'll import the data and reduce the scope down to one store. Also recall that we need to preprocess the data in Pandas (convert the time data to a datetime
object and set it as the index of the DataFrame).
Computing Autocorrelation
To compute autocorrelation using the Pandas .autocorr()
function, we enter the parameter for lag
. Recall that lag is the delta between the given point and the prior point used to compute the autocorrelation.
With a k value of one, we'd compute how correlated a value is with the value that's immediately prior. With a k value of 10, we'd compute how correlated a variable is with the value that's 10 time points prior.
Just like with correlation between different variables, the data become more correlated as this number moves closer to one.
Pandas provides convenience plots for autocorrelations.
StatsModels also comes with some convenient packages for calculating and plotting autocorrelation. Load up these two functions and try them out.
This plots the correlation between the series and a lagged series for the lags indicated on the horizontal axis. For instance, at 0
, the series will be completely correlated with itself, so the blue dot is at 1.0
. The points that fall outside of the blue indicate significant correlation values. Big jumps in autocorrelation appear at lags that are multiples of seven. Our sales data are daily, so it makes a lot of sense that a single Monday's sales would be correlated with the prior Monday's (and the one before it... and so on).
Our data set here isn't stationary (the mean, the variance, and/or the covariance vary over time), so it isn't appropriate to try to diagnose what forecasting model we should use. However, we can see the seasonality of the data set clearly in the ACF.
Another important chart for diagnosing your time series is the partial autocorrelation chart (PACF). This is similar to autocorrelation, but, instead of being just the correlation at increasing lags, it is the correlation at a given lag, controlling for the effect of previous lags.
Load up the sister functions for partial autocorrelation from StatsModels and test them out on the differenced time series.
This plots the correlation at a given lag (indicated by the horizontal axis), controlling for all of the previous lags. We continue to see big jumps in correlation at the weekly time lags, an indicator that seasonality is still present in our time series.
Check: How might seasonality in a data set (monthly, weekly, etc.) show up in autocorrelation plots?
Models like linear regression analysis require that there is little or no autocorrelation in the data. That is, linear regressions requires that the residuals/error terms are independent of one another. So far, we have assumed all of the independent values in our models have been independent, but this is unlikely with time series data, because the temporal component of time series models means that they will often contain autocorrelation.
What are some problems that could arise when using autocorrelated data with a linear model?
Estimated regression coefficients are still unbiased, but they no longer have the minimum variance property.
The MSE may seriously underestimate the true variance of the errors.
The standard error of the regression coefficients may seriously underestimate the true standard deviation of the estimated regression coefficients.
Statistical intervals and inference procedures are no longer strictly applicable.
Check: Why can't we apply linear regression to most time series data sets?
As we learned above, the autocorrelation function (ACF) is a plot of total correlation between different lags. If we decide to use the moving average (MA) method for forecasting, the ACF plot will help us identify the order of the MA model. We can find the lag (the q value) for an MA series by determining when the ACF drops off sharply. For an autoregressive (AR) time series, the ACF will go down gradually without any sharp cut-off.
If the ACF tells us it is an AR series, then we turn to the PACF. If we find out the partial correlation of each lag, it will cut off after the degree of the AR series (the p value). For instance, if we have a AR(1) series, the partial correlation function (PACF) will drop sharply after the first lag.
We'll learn more about AR and MA models in this lesson's bonus section.
Recap
Autocorrelation is a measure of how dependent a data point is on previous data points.
Investigating ACF and PACF plots can help us identify an appropriate forecasting model and look for seasonality in our time series data.
Simple linear regression cannot apply to data with autocorrelations because these data no longer have independent errors.
Instructor Note: These are optional and can be assigned as student practice questions outside of class.