Path: blob/master/time_series/1_exponential_smoothing.ipynb
1470 views
Table of Contents
Getting Started with Time Series Analysis
The gist behind time series analysis is that we are given some quantitative measures about the past and we wish to use these informations to predict the future to enable better planning, decision-making and so on. The main difference between time series problem and traditional prediction problems is that: in traditional prediction problems such as image classification, the data points there are assumed to be independent of one another. Whereas, time series analysis' data points have a temporal nature in them, i.e. The time dimension adds an explicit ordering to our data points that should be preserved because they can provide additional/important information to the learning algorithms.
As an example, we will look at a real mobile game data that depicts ads watched per hour.
Averages
To make a prediction of the next point given this time series, one of the most naive method that we can use is the arithmetic average of all the previously observed data points. We take all the values we know, calculate the average and bet that that's going to be the next value. Of course it won't be it exactly, but it probably will be somewhere in the ballpark (e.g. your final school grade may be the average of all the previous grades).
Where refers to the predicted value at time
An improvement over simple arithmetic average is the average of last points. The rationale here is that only recent values matter and adding previous data points into consideration would only be adding noise. Calculation of the moving average involves what is sometimes called a "sliding window" of size .
Although we can't really use this method for making predictions really far out into the future (because in order to get the value for the next step, we need the previous values to be actually observed), the moving average method can be used to smooth the original time series for spotting trend. As we'll soon see, the wider the window, the smoother the trend.
While it might not seem that useful when we set the window size to be 4, if we were to apply the smoothing on a 24 hour window, we would get a daily trend that shows us a more interesting and perhaps expected pattern. That is: during the weekends, the values are higher (more time to play on the weekends?) while fewer ads are watched on weekdays.
Side note, the following code chunk shows an implementation of moving average without using pandas' rolling functionality.
Exponential Smoothing
Now let's see what happens if, instead of only weighting the time series' last values, we would weight all available observations while exponentially decreasing the weights as we move further back in time.
A weighted moving average is a moving average where within the sliding window values are given different weights, typically so that more recent points matter more. Instead of only weighting the time series' last values, however, we could instead consider all of the data points, while assigning exponentially smaller weights as we go back in time. This method is so called Exponential Smoothing. The mathematical notation for this method is:
To compute the formula, we pick an and a starting value (i.e. the first value of the observed data), and then calculate recursively for . As we'll see in later sections, is also referred to as levels.
We can think of as the smoothing factor or memory decay rate, it defines how quickly we will "forget" the last available true observation. The smaller is, the more influence the previous observations have and the smoother the series is. In other words, the higher the , the faster the method "forgets" about the past.
Keep in mind that each series has its best value. The process of finding the best is referred to as fitting and we will discuss it in later section.
One very important characteristic of exponential smoothing is that remarkably, they can only forecast the current level. If we look at the formula again , we can see that in order to make the prediction for we also need to have the observed value . In some other software like R, if we were to use it to predict the future, it will simply assign all future prediction the last value of the time series.
Double Exponential Smoothing - Holt Method
The idea behind Double Exponential Smoothing (a.k.a the Holt Method) is exponential smoothing applied to both level and trend. The basic idea is saying if our time series has a trend, we can incorporate that information to do better than just estimating the current level and using that to forecast the future observations. To achieve this, we will introduce two new notations: the current "trend", denoted by (we can think of it as the slope of the time series), as well as the current "level", denoted by .
To express this in mathematical notation we now need three equations: one for level, one for the trend and one to combine the level and trend to get the expected .
, level is simply predicted point. But because now it's going to be only part of calculation of the forecast (our forecast is a combination of predicted point and trend), we can no longer refer to it as
The second equation introduces , the trend coefficient. As with , some values of work better than others depending on the series. When is big, we won't give too much weight to the past trends when estimating current trend
Similar to exponential smoothing, where we used the first observed value as the first expected value, we can use the first observed trend as the first expected trend, i.e. we'll use the first two points to compute the initial trend, i.e. .
Side note: Additive vs Multiplicative
Another thing to know about trend is that instead of subtracting from to estimate its initial value, we could instead divide one by the other thereby getting a ratio. The difference between these two approaches is similar to how we can say something costs $20 more or 5% more. The variant of the method based on subtraction is known as additive, while the one based on division is known as multiplicative. The additive method, is more straightforward to understand. Thus, we will stick with the additive method here. In practice, we can always try both to see which one is better.
To perform k-step-ahead forecast, we can use linear extrapolation:
Now we have to tune two parameters: and . The former is responsible for the series smoothing around the trend, the latter for smoothing the trend itself. Again, the larger the values for these parameters, the more weight recent observations will carry and the less smooth the modeled series will be.
Although this method can now predictive future values, if we stare closer at the forecast formula , we can see that once the trend () is estimated to be positive, all future predictions can only go up from the last value in the time series. On the other hand, if the trend () is estimated to be negative, all future predictions can only go down. This property makes this method unsuitable for predicting very far out into the future as well. With that in mind, let's now turn towards triple exponential smoothing.
Triple Exponential Smoothing - Holt-Winters Method
The idea behind triple exponential smoothing (a.k.a Holt-Winters Method) is to apply exponential smoothing to a third component - seasonality, . This means we should not be using this method if our time series is not expected to have seasonality.
This seasonal components in the model will explain repeated variations around intercept and trend, and it will be specified by the length of the season, in other words, by the period which the variations repeats itself. To elaborate, for each observation in the season, there will be a separate component. For example, if the length of the season is 7 days (a weekly seasonality), we will have 7 seasonal components, one for each day of the week. Then the seasonal component of the 3rd point into the season would be exponentially smoothed with the 3rd point of last season, 3rd point two seasons ago, etc. In mathematical notation, we now have four equations:
Season length is the number of data points after which a new season begins. We will use to denote season length.
Note that when estimating the level , we subtract the estimated seasonality from . The trend part remains the same and when estimating the seasonality , we subtract the estimated level, , from it.
We now have a third coefficient, , which is the smoothing factor for the seasonal component.
The index for the forecast, , is where can be any integer. Meaning we can forecast any number of points into the future while accounting for previous value, trend and seasonality.
The index of the seasonal component of the forecast may appear a little mind boggling, but as we'll soon see in the implementation, this is essentially offsetting into our observed data's list of seasonal components. e.g. if we are forecasting the 3rd point into the season, and we are 45 seasons into the future, we cannot use seasonal components from the 44th season in the future since that season is also generated by our forecasting procedure, we must use the last set of seasonal components from observed points, or from "the past" if you will.
Initial Trend Component:
Back when we introduced the double exponential smoothing method, we were required to provide an estimate of the initial trend for the model. The same applies for triple exponential smoothing. Except, with seasonal data we can do better than using the first two points to estimate the initial trend. The most common practice here is to compute the average trend averages across seasons.
Initial Seasonal Components:
The situation is slightly more complicated when it comes to providing initial values for seasonal components. Briefly, we need to compute the average level for every observed season we have, then subtract every observed value by the average for the season it's in and finally average each of these numbers across our observed seasons. We'll will forgo the math notation for the initial seasonal components. The following link contains a step-by-step walk-through of this process if interested. Notes: Triple Exponential Smoothing
Time Series Cross Validation
Before we start building this model, let's take a step back and first discuss how to estimate model parameters automatically.
As always, we have to choose a loss function suitable for the task that will tell us how closely the model approximates the underlying pattern. Then, using cross-validation, we will evaluate our chosen loss function for the given model parameters. The only minor difference, compared with standard supervised learning methods is the way to perform cross validation. Because time series data have this temporal structure, one cannot randomly mix values in a fold while preserving this structure. With randomization, all the time dependencies between observations will be lost, hence the cross validation method that we'll be using will be based on a rolling window approach.
The idea is -- we train our model on a small segment of the time series from the beginning until some time , make predictions for the next steps, and calculate an error. Then, we expand our training sample to value, make predictions from until , and continue moving our test segment of the time series until we hit the last available observation.
Apart from setting up our cross validation, we also need an numerical optimization algorithm to learn our Holt-Winters model's parameters , and . For this model, as well as in other exponential smoothing methods, there's a constraint on how large the smoothing parameters can be, each of them ranging from 0 to 1. Therefore, in order to minimize our loss function, we have to choose an algorithm that supports constraints on model parameters. In our case, we will use the truncated Newton conjugate gradient (we'll use scipy's minimize
function to achieve this, instead of going into the details of this optimization algorithm as this is not the focus of this documentation).
The next couple of code chunks, sets up the cross validation, uses the numerical optimization algorithm to learn the optimal parameter and plots the predicted time series versus the original.
Learning Holt-Winters Method's Parameters
Judging by the plot, our model was able to approximate the initial time series, capturing the daily seasonality, overall downwards trend. We've also used an evaluation metric called MAPE (Mean Absolute Percentage Error):
Mean Absolute Percentage Error measures the size of our model's absolute error (the absolute difference between our model's prediction versus the actual number) when compared with the original value.
Pros:
This is a preferable metric because most people are comfortable thinking in terms of percentage. e.g. When predicting an item's demand volume, it might be more interpretable to tell our manager that the prediction is off by less than 4% rather than saying we're off by 3,000 items if the manager doesn't know what's the typical demand for this item.
Cons:
The percentage interpretation is a double-edged sword. Looking at the formula of the evaluation metric, we can see that the actual observed data's number is in the denominator of the equation. Meaning this value is actually not defined when the actual value is zero. And the actual caveat is when the actual value is really small, then MAPE will often take on extremely large values, rendering it unstable for low-volume data.
Apart from MAPE, some other evaluation metrics that are commonly used in the field of forecasting are Mean Absolute Error and Median Absolute Error, where both metric retains the original time series' measurement unit, and median absolute error makes the metric more robust to outliers.
This concludes our discussion with Additive Exponential Smoothing methods, for readers interested, the following link contains resources that introduces both additive and multiplicative methods. Online Book: Forecasting Principle and Practices - Exponential smoothing
Reference
Blog: Holt-Winters Forecasting for Dummies (or Developers) - Part I
Blog: How To Backtest Machine Learning Models for Time Series Forecasting
Blog: Forecasting 101: A Guide to Forecast Error Measurement Statistics and How to Use Them
Jupyter Notebook: Open Machine Learning Course - Time series analysis in Python