Path: blob/master/5.1 Linear regression - Kalman filter.ipynb
1675 views
Linear Regression by Kalman filter
The Kalman filter is a great tool!!
This is one of my favourite algorithms. It can be used in almost all practical applications in finance. And this is the reason I decided to write a notebook on the applications of the Kalman filter to linear regression.
I suggest the beginners in this area to have a look at https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python. In my opinion this is one of the best interactive tutorials on Bayesian Filters ever written (it is also the only one that I know).
Contents
Real market data
I have collected some market time series from different sources and saved them in historical_data.csv
, such that now we can play with them.
The tickers are:
Data can contain NaNs, negative values, zeros and outliers. Let's have a look!
Fortunately there are no non-positive values, but there are several NaNs. Why?
There can be many reason, but in this case I think it is because of a different calendar (different countries have different holidays).
How to solve this problem?
There are several possibilities:
We may remove the entire line
We may forward fill the missing values
We may backward fill the missing values
We may interpolate or estimate (e.g. regression) the missing values
All these possibilities have advantages and disvantages. This is a delicate task, for more information see Imputation.
First of all, let us select the two tickers for the regression analysis.
Feel free to change the tickers in the next cell:
In this analysis I use the following convention:
If in one day there is only one value missing, I use the method 2) i.e. forward filling. I can justify this choice by saying that if there is no price, it means that the price hasn't changed.
If in one day both values are missing, then I remove the entire line. (Probably the market was closed on that day)
If we call the price under consideration (stock, index, rate) at time , we can work with three types of returns :
Total returns:
Linear returns:
Logarithmic returns:
For now, I prefer to work with log-returns. But feel free to try other options.
Plot
What about the outliers?
We can see that there are returns that are quite bigger than (three standard deviations). If we assume a Normal distribution for the returns, these outliers can create problems.
Wait a second...
we calculated the standard deviation ret.std()
using the data containing the ouliers!!
It means that the value of the standard deviation is aslo affected by the outliers!!
The solution is to use robust estimators of the location and the dispersion of a distribution. They work much better in presence of outliers.
Although there are several alternatives, I prefer to use the median and the MAD. These estimators are more robust than the mean and standard deviation. In the standard deviation, the distances from the mean are squared, so large deviations are weighted more heavily, and thus outliers can heavily influence it. In the MAD instead, the deviations of a small number of outliers are irrelevant.
As explained on wiki, in order to use the MAD as a consistent estimator of the standard deviation, we have to take
where under the assumption of Normal distribution, .
Linear regression
Let me recall some well known notions about linear regression. The linear model is:
where
X is the predictor variable
Y is the response variable
is the error such that and .
An additional useful hypothesis is to consider normal errors . This assumption permits to use MLE methods and to calculate confidence intervals.
The parameters and can be estimated by:
and, using the bilinear property of the Covariance:
If we have some observations , we can use the linear regression as a model for the respose variables only (which are random) , where the are uncorrelated. The expectation is intended as a conditional expectation .
If we have two sets of data and for , where are the realizations of , we call: and . We also define
The estimator and for and are:
These estimators are the least squares estimators. They are also MLE estimators and BLUE (best linear unbiased estimators). Of course and (they are unbiased). We also have:
and under the assumption of Normal errors, the estimators are also Normal distributed. (the square root of the variances above, is called "standard error" of and ) The residual estimator is . We can define the estimator for .
MLE (biased) it has .
Unbiased version: Under the normal assumption of errors, it is distributed as .
Confidence intervals can be obtained from the statistics:
where the parameter is replaced by its unbiased estimator . For a short (but complete) description of confidence intervals you can have a look at the notebook 1.2. For more info see CI for linear regression, or the deeper presentation in Chapter 11.3 of [1].
Last, let us recall (with no proof) that if we call , then
We can define the coefficient of determination such that , as
Come on....Let's compute everything: (with the help of the scalar product between vectors @
)
The intercept is (always) almost zero. Therefore it is not relevant for the future analysis.
Plot:
The first thing we notice is that, although we removed the outliers, the Normal fit is still not so good. The distribution is better described by a Student t distribution with
However, it is much easier to mantain the Normal assumption, because it permits to use confidence intervals and... also... the Kalman filter!!
I do not have space and time to explain the meaning and the details of the Kalman Filter in this notebook. I assume the reader already has a basic knowledge. Otherwise, for a pedagogical exposition, I suggest to have a look at the tutorial mentioned at the top. The presentation on wikipedia, although very short, is also quite clear.
Important things to remember: the KF (Kalman Filter) assumes a linear dynamics for the true state with , and Normally distributed errors.
The first equation is the process equation for the true state, where is the state transition function (or matrix), and is the process noise with covariance . The second equation is the measurement equation, where are the measurement of the true state . The term is the measurement function (or matrix), and is the measurement noise with covariance .
The Kalman filter equations, taken from here, are:
The notation represents the (a priori) estimate of at time given an observation (a measurement) at time . Instead, represents the (a posteriori) estimate of at time given an observation at time .
Comments:
Residuals are computed in the measurement space. The variable needs to be transformed by before the subtraction with .
The (a priori) covariance is simply the conditional covariance of the process dynamics:
The term is the conditional covariance of the measurement.
Of course and are independent of .
Marginal probability distribution
Let us recall the marginal distribution, where the hidden state variables are integrated out. If we define , we have:
where and are the mean and covariance of normal distribution of .
Kalman regression model
Let us introduce the model for the Kalman regression. The state variable is . These variables are hidden i.e. they are not measurable directly. We can only measure and .
Notation issue!
Here I use the variable to indicate the response variable in the regression and not the residual variable of the Kalman filter. Above I wanted to use the same notation of wikipedia. Below, I indicate the Kalman residuals with r_k
.
The model equations are:
process equation:
measurement equation:
$$y_k = \bigl(\begin{array}{cc} 1 & x_k \end{array}\bigr) \biggl(\begin{array}{c} \alpha_k\\ \beta_k \end{array}\biggr) + \epsilon_k \quad \text{with} \quad \epsilon_k \sim \mathcal{N}(0, \sigma^2_{\epsilon}) \\$$We see that the transition matrix is the identity , and the measurement matrix is . We also assume and uncorrelated.
Training set
I also rename the columns in order to clarify which column is the X and which is the Y. I overwrite the variables , used before.
Rolling alpha and beta
Here I calculate the OLS and in the test set, using a rolling window.
Implementation of the Kalman filter for and
In the following cell I decided to initialize the state using the OLS values calculated with the data in the training set. The initial covariance matrix is set to . The variance parameters are set to and .
Intuitively, it makes sense to design the process dynamics with a standard deviation . The error describes how the beta changes over time, e.g. if the beta remains constant, if the beta can assume unreasonable values. If, for instance, the current value of beta is , with probability at the next time step the beta will be inside the interval ( by the three-sigma rule). The same argument works for the choice of the variance of .
The parameter is set equal to the OLS standard deviation estimate, computed in the training set.
All these values must be calibrated using the training dataset!. This is a difficult task because the variables are not observable! In general, the filter designer has an important role in the selection of the process and model errors. The parameters should be chosen in order to avoid overfitting/underfitting and to avoid unrealistic values.
Possible methods (but there are more) are:
(1) Using reasonable intuitive assumptions (such as the assumption I used above)
(2) Compute the rolling and in the training set and calculate the variances. Nevertheless, these values will depend on the selected time window.
(3) Using MLE method. Later I will apply it in a simplified model.
A useful thing to do, is to create a validation set. This is an additional set that can be created as a part of the training set. (I will not make use of it in this notebook, although it can be very useful) We can run the Kalman filter on the validation set and check if the estimated parameters produce good results. For instance, we could verify the intuitive assumption in (1), the choice of the rolling window in (2), and the quality of the MLE estimator. In case of bad results, we can repeat the estimation in the training set under different assumptions.
Important!
If you have changed the tickers, the parameters described above need to be changed as well.
Comments
Looking at the two plots above, we can see that the curves are very similar. An important thing to notice is that the OLS is shiftet on the right i.e. OLS is delayed! The Kalman estimates can reflect much better the true (spot) value of the state.
The filter is very robust with respect to initial guesses. If, for instance, you try to set
P = 100 * np.eye(2)
, you will see that it will converge quickly to a "reasonable" value.In the matrix
H[None,i]
I had to use the keywordNone
in order to maintain the information of the array dimensions,H[None,0].shape = (1,2)
. This is a bad feature of the language. Python automatically removes dimensions of a subarray i.e.H[0].shape = (2,)
.K @ H[None,i]
is an outer product of dimension .I used inv(S) to compute the inverse of S. In the notebook A1 I repeated many times that inverting matrices is bad. But for the Kalman filter, since usually S has small rank (here S is a scalar), we can make an exception 😃
PROS
The Kalman filter does not need an entire dataset (or time window) to estimate the current state. It just needs ONE measurement! The OLS estimate, instead, depends on the length of the time window, and it never reflects the spot value of the state because it is affected by past data.
CONS
The Kalman filter performances are strongly dependent on the process and measurement errors!! These values are not always measurable, and are hard to calibrate.
We have seen that and that it does not vary that much. For this reasons, the parameter alpha is usually not considered as a relevant parameter.
Comment:
This is in accordance with the well known Capital Asset Pricing Model CAPM, that assumes .
Therefore we can simplify the process dynamics by assuming that it is a simple constant . For simplicity we can set , but now I want to present the general case .
Process equation:
Measurement equation:
$$y_k = \bigl(\begin{array}{cc} 1 & x_k \end{array}\bigr) \biggl(\begin{array}{c} \alpha\\ \beta_k \end{array}\biggr) + \epsilon_k \quad \text{with} \quad \epsilon_k \sim \mathcal{N}(0, \sigma^2_{\epsilon}) \\$$Now, we can obtain a simple expression for the Kalman filter variables:
Predict:
Variables:
Update:
Considering the beta-component only, the two update equations are:
Calibration
We can calibrate the two parameters and using the data of the training set.
The idea is to use the MLE method i.e. to maximize the following
which is the log-likelihood obtained from Normal distributions. For a derivation of this formula see wiki.
Let us recall that all the measurements are independent. When the measurement equation is
the variables are normally distributed with mean
and variance
The algorithms for the beta regression and the calibration based on the MLE method are implemented in the file Kalman_filter.py.
Calibration using MLE
Log-likelihood plot
Log-likelihood can be ugly functions in general!
It is always better (when possible) to visualize the plot. It will be easier to understand if it has a global maximum. In general the log-likelihood can be flat, or can have many local maximum points.
In our case the log-likelihood has the following shape:
Comments:
The log-likelihood is a concave function of . Therefore there is a global maximum.
Sometimes, it happens that the log-likelihood is a decreasing function of . Therefore the maximum is at . In this case the value doesn't make sense. If we use , it means that we are considering the process model as a perfect model (with no error). The consequence is that the filter will hardly be affected by new measurements.
The log-likelihood is concave with respect to and has a maximum point (for each fixed ). This is very good! We can see that the MLE estimator is very close to the OLS estimator!!
The error comparison above is only indicative. Since the OLS error can be set arbitrarily small by simply choosing a bigger time window (the more data we have, the higher will be).
Quite good!
Using the Kalman filter we obtain very smooth values of beta. The estimation error is smaller and smoother as well.
What about the ?
A possible alternative to the MLE, could be to maximize the . Let us see if it works.
The following function considers a fixed and searches for the optimal value of in the range bounds=[[1e-15,1]]
.
There are two reasons to calibrate only one parameter.
The first reason is that we have already seen that obtained with OLS is a good value. It is very close to the MLE estimate.
The second reason is that the is maximized at and for a high value of . When the measurement error is zero, it means that the measurements are perfect and the Kalman filter simply selects the values of beta that solve the linear equation . Of course should be big enough to allow fast changes of beta. In this case we have an overfitting problem. Using a high also means that our process model has a very low predictive power.
Let us see what happens when calibrating only .
Considering the pre-fit residuals, the optimizer returns the output , which is the lowerbound imposed by myself. This is not an acceptable value.
Considering the post-fit residuals, the output is the upper bound in the parameter range
bounds=[[1e-15,1]]
. Overfitting
By looking at the picture below, we see that the is an increasing function of , i.e. there is no optimal .
Well... The maximization is not a good idea! It doesn't work.
Let us see what happens if we run the Kalman filter in the test set using a big value of and a small value of (here I set to var_eps_ols).
The is very close to 1, which means almost a perfect fit. However the obtained values of are unreasonable.
Conclusions
We have seen that using the for the calibration is not a good idea because it produces overfitting.
The MLE sometimes works, sometimes not. Most of the time, the problematic parameter to estimate is .
The OLS estimator for is very close to the MLE for .
The parameter must be chosen by the filter designer, using reasonable assumptions.
References
[1] Casella, George and Berger, Roger. (2001) "Statistical inference" Duxbury Advanced Series