Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
cantaro86
GitHub Repository: cantaro86/Financial-Models-Numerical-Methods
Path: blob/master/5.1 Linear regression - Kalman filter.ipynb
1675 views
Kernel: Python 3

Linear Regression by Kalman filter

The Kalman filter is a great tool!!

This is one of my favourite algorithms. It can be used in almost all practical applications in finance. And this is the reason I decided to write a notebook on the applications of the Kalman filter to linear regression.

I suggest the beginners in this area to have a look at https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python. In my opinion this is one of the best interactive tutorials on Bayesian Filters ever written (it is also the only one that I know).

Contents

import numpy as np import datetime as dt import pandas as pd import importlib import matplotlib.pyplot as plt from mpl_toolkits import mplot3d from matplotlib import cm import matplotlib.gridspec as gridspec %matplotlib inline import statsmodels.api as sm import scipy.stats as ss from scipy.optimize import minimize from IPython import display from scipy.linalg import inv import FMNM.Kalman_filter as KF

Real market data

I have collected some market time series from different sources and saved them in historical_data.csv, such that now we can play with them.

The tickers are:

  • STOCKS: AAPL-> Apple, EXO.MI-> Exor (FIAT,Ferrari,etc), FB-> Facebook, GOOGL-> Google, UBS-> UBS,

  • INDECES: ^GSPC-> S&P500, ^IXIC-> Nasdaq, ^VIX-> VIX (Implied Volatility of S&P500).

  • BOND:^TNX-> Yield 10Y US.

filename = "data/" + "historical_data.csv" data = pd.read_csv(filename, index_col="Date", parse_dates=True) data.tail()

Data cleaning

Data can contain NaNs, negative values, zeros and outliers. Let's have a look!

data[data.isna().any(axis=1)].tail()
(data < 1e-2).any()
AAPL False EXO.MI False FB False GOOGL False UBS False ^GSPC False ^IXIC False ^TNX False ^VIX False dtype: bool

Fortunately there are no non-positive values, but there are several NaNs. Why?

There can be many reason, but in this case I think it is because of a different calendar (different countries have different holidays).

How to solve this problem?

There are several possibilities:

    1. We may remove the entire line

    1. We may forward fill the missing values

    1. We may backward fill the missing values

    1. We may interpolate or estimate (e.g. regression) the missing values

All these possibilities have advantages and disvantages. This is a delicate task, for more information see Imputation.

First of all, let us select the two tickers for the regression analysis.

Feel free to change the tickers in the next cell:

df = data[["GOOGL", "^GSPC"]]

In this analysis I use the following convention:

  • If in one day there is only one value missing, I use the method 2) i.e. forward filling. I can justify this choice by saying that if there is no price, it means that the price hasn't changed.

  • If in one day both values are missing, then I remove the entire line. (Probably the market was closed on that day)

history_len = 1000 # lenght of the time series df.sort_index(inplace=True, ascending=True) # not necessary in general, but useful df = df.dropna(axis=1, how="all") # drops columns with all NaNs df = df.dropna(axis=0, how="all") # drops rows with at least one NaN df = df[-history_len:] df = df.ffill() # Forward fill print("Are there still some NaNs? ") df.isnull().any()
Are there still some NaNs?
GOOGL False ^GSPC False dtype: bool

If we call StS_t the price under consideration (stock, index, rate) at time tt, we can work with three types of returns XtX_t:

  • Total returns: Xt=StSt1 X_t = \frac{S_t}{S_{t-1}}

  • Linear returns: Xt=StSt11 X_t = \frac{S_t}{S_{t-1}} - 1

  • Logarithmic returns: Xt=logStSt1 X_t = \log \frac{S_t}{S_{t-1}}

For now, I prefer to work with log-returns. But feel free to try other options.

returns = "log-returns" if returns == "log-returns": ret = np.log(df / df.shift())[1:] elif returns == "linear-returns": ret = df.pct_change()[1:] elif returns == "total-returns": ret = df / df.shift()[1:]
Plot
NOSD = 3 # Number Of Standard Deviations mu = ret.mean() sig = ret.std() fig = plt.figure(figsize=(16, 5)) ax1 = fig.add_subplot(121) ax2 = fig.add_subplot(122) df.plot(ax=ax1) ax1.set_title("Time series of prices") ax1.set_xlabel("Time") ax1.set_ylabel("Price") ret.plot(ax=ax2) ax2.set_title("Time series of returns") ax2.set_xlabel("Time") ax2.set_ylabel(returns) ax2.plot(ret.index, len(ret) * [mu], color="k") ax2.plot(ret.index, len(ret) * [mu + NOSD * sig], color="c") ax2.plot(ret.index, len(ret) * [mu - NOSD * sig], color="c") plt.show()
Image in a Jupyter notebook

What about the outliers?

We can see that there are returns that are quite bigger than 3σ3\sigma (three standard deviations). If we assume a Normal distribution for the returns, these outliers can create problems.

Wait a second... we calculated the standard deviation ret.std() using the data containing the ouliers!! It means that the value of the standard deviation is aslo affected by the outliers!!

The solution is to use robust estimators of the location and the dispersion of a distribution. They work much better in presence of outliers.

Although there are several alternatives, I prefer to use the median and the MAD. These estimators are more robust than the mean and standard deviation. In the standard deviation, the distances from the mean are squared, so large deviations are weighted more heavily, and thus outliers can heavily influence it. In the MAD instead, the deviations of a small number of outliers are irrelevant.

As explained on wiki, in order to use the MAD as a consistent estimator of the standard deviation, we have to take

σ=kMAD\sigma = k \, \text{MAD}

where under the assumption of Normal distribution, k=1.4826k = 1.4826.

median = ret.median() MAD = ret.apply(ss.median_abs_deviation) sig_robust = MAD * 1.4826 print("STD DEV: \n", sig) print("") print("Robust STD DEV: \n", sig_robust);
STD DEV: GOOGL 0.014242 ^GSPC 0.008324 dtype: float64 Robust STD DEV: GOOGL 0.010268 ^GSPC 0.005255 dtype: float64
outliers = ret[ret[np.abs(ret - median) > NOSD * sig_robust].any(axis=1)] # outliers dret = ret.drop(outliers.index) # series of returns without outliers
fig = plt.figure(figsize=(16, 6)) ax1 = fig.add_subplot(121) ax2 = fig.add_subplot(122) for ax, col in zip([ax1, ax2], dret.columns): dret[col].plot(ax=ax) ax.set_title("Clean series - No outliers") ax.set_xlabel("Time") ax.set_ylabel(returns) ax.plot(dret.index, len(dret) * [median[col]], color="k") ax.plot(dret.index, len(dret) * [median[col] + NOSD * sig_robust[col]], color="r", label="+3$\sigma$") ax.plot(dret.index, len(dret) * [median[col] - NOSD * sig_robust[col]], color="r", label="-3$\sigma$") ax.legend(loc="upper right") plt.show()
Image in a Jupyter notebook
print( "We have reduced the initial DIRTY series of returns of size {} to the \ new CLEAN series of size {}.".format( len(ret), len(dret) ) )
We have reduced the initial DIRTY series of returns of size 999 to the new CLEAN series of size 911.

Linear regression

Let me recall some well known notions about linear regression. The linear model is:

Y=α+βX+ϵY = \alpha + \beta X + \epsilon

where

  • X is the predictor variable

  • Y is the response variable

  • ϵ\epsilon is the error such that Std[ϵ]=σStd[\epsilon]= \sigma and Cov[ϵ,X]=0Cov[\epsilon,X]=0.

An additional useful hypothesis is to consider normal errors ϵN(0,σ2)\epsilon \sim \mathcal{N}(0,\sigma^2). This assumption permits to use MLE methods and to calculate confidence intervals.

The parameters α\alpha and β\beta can be estimated by:

E[Y]=α+βE[X]α=E[Y]βE[X]\mathbb{E}[Y] = \alpha + \beta \mathbb{E}[X] \quad \Rightarrow \quad \alpha = \mathbb{E}[Y] - \beta \mathbb{E}[X]

and, using the bilinear property of the Covariance:

Cov[X,Y]=Cov[X,α+βX+ϵ]=βVar[X]Cov[X,Y] = Cov[ X, \alpha + \beta X + \epsilon ] = \beta Var[X]β=Cov[X,Y]Var[X]\beta = \frac{Cov[X,Y]}{Var[X]}

If we have some observations xix_i, we can use the linear regression as a model for the respose variables only (which are random) Yi=α+βxi+ϵiY_i = \alpha + \beta x_i + \epsilon_i, where the ϵi\epsilon_i are uncorrelated. The expectation E[Yi]\mathbb{E}[Y_i] is intended as a conditional expectation E[Yixi]\mathbb{E}[Y_i|x_i].

If we have two sets of data yiy_i and xix_i for 1in1 \leq i \leq n, where yiy_i are the realizations of YiY_i, we call: xˉ=1ni=1nxi\bar x = \frac{1}{n} \sum_{i=1}^n x_i and yˉ=1ni=1nyi\bar y = \frac{1}{n} \sum_{i=1}^n y_i. We also define

Sxx=i=1n(xixˉ)2Syy=i=1n(yiyˉ)2Sxy=i=1n(xixˉ)(yiyˉ)S_{xx} = \sum_{i=1}^n (x_i - \bar x)^2 \quad S_{yy} = \sum_{i=1}^n (y_i - \bar y)^2 \quad S_{xy} = \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)

The estimator β^\hat \beta and α^\hat \alpha for β\beta and α\alpha are:

β^=SxySxxα^=yˉβxˉ\hat \beta = \frac{S_{xy}}{S_{xx}} \quad \quad \hat \alpha = \bar y - \beta \bar x

These estimators are the least squares estimators. They are also MLE estimators and BLUE (best linear unbiased estimators). Of course E[β^]=β\mathbb{E}[\hat \beta]=\beta and E[α^]=α\mathbb{E}[\hat \alpha]=\alpha (they are unbiased). We also have:

Var[β^]=σ2SxxVar[α^]=σ2nSxxi=1nxi2Var[\hat \beta] = \frac{\sigma^2}{S_{xx}} \quad Var[\hat \alpha] = \frac{\sigma^2}{n S_{xx}}\sum_{i=1}^n x_i^2

and under the assumption of Normal errors, the estimators are also Normal distributed. (the square root of the variances above, is called "standard error" of α^\hat \alpha and β^\hat \beta) The residual estimator is ϵ^i=yiα^β^xi\hat \epsilon_i = y_i - \hat \alpha - \hat \beta x_i. We can define the estimator for σ\sigma.

  • MLE (biased) σ^2=1ni=1nϵi2 \hat \sigma^2 = \frac{1}{n} \sum_{i=1}^n \epsilon_i^2 it has E[σ^]=n2nσ2\mathbb{E}[\hat \sigma] = \frac{n-2}{n} \sigma^2 .

  • Unbiased version: S2=nn2σ^2=1n2i=1nϵi2 S^2 = \frac{n}{n-2}\hat \sigma^2 = \frac{1}{n-2} \sum_{i=1}^n \epsilon_i^2 Under the normal assumption of errors, it is distributed as (n2)S2σ2χn22\frac{(n-2)S^2}{\sigma^2} \sim \chi^2_{n-2}.

Confidence intervals can be obtained from the statistics:

α^αS(i=1nxi2)/nSxxtn2andβ^βSSxxtn2\frac{\hat \alpha - \alpha}{ S \sqrt{(\sum_{i=1}^n x_i^2)/nS_{xx}} } \sim t_{n-2} \quad and \quad \frac{\hat \beta - \beta}{ S \sqrt{S_{xx}} } \sim t_{n-2}

where the parameter σ\sigma is replaced by its unbiased estimator SS. For a short (but complete) description of confidence intervals you can have a look at the notebook 1.2. For more info see CI for linear regression, or the deeper presentation in Chapter 11.3 of [1].

Last, let us recall (with no proof) that if we call y^i=α^+β^xi\hat y_i = \hat \alpha + \hat \beta x_i, then

Syy=i=1n(yiyˉ)2=  i=1n(y^iyˉ)2+i=1n(yiy^i)2=  Sxy2Sxx+i=1nϵ^i2.S_{yy} = \sum_{i=1}^n (y_i - \bar y)^2 =\; \sum_{i=1}^n (\hat y_i - \bar y)^2 + \sum_{i=1}^n (y_i - \hat y_i)^2 =\; \frac{S^2_{xy}}{S_{xx}} + \sum_{i=1}^n \hat \epsilon_i^2 .

We can define the coefficient of determination r2r^2 such that 0r210 \leq r^2 \leq 1, as

r2=  i=1n(y^iyˉ)2i=1n(yiyˉ)2=  Sxy2SxxSyy=  1i=1n(ϵ^i)2i=1n(yiyˉ)2.\begin{align*} r^2 &=\; \frac{\sum_{i=1}^n (\hat y_i - \bar y)^2}{\sum_{i=1}^n (y_i - \bar y)^2} =\; \frac{S_{xy}^2}{S_{xx}S_{yy}} \\ &=\; 1 - \frac{\sum_{i=1}^n (\hat \epsilon_i)^2}{\sum_{i=1}^n (y_i - \bar y)^2}. \end{align*}
X = dret[dret.columns[1]] Y = dret[dret.columns[0]] print("In our regression Y is {} and X is {}".format(dret.columns[0], dret.columns[1]))
In our regression Y is GOOGL and X is ^GSPC
b_ss, a_ss, r_ss, _, std_err_ss = ss.linregress(X, Y) print( "We can use one of the many python functions:\nbeta: {0:6f}, alpha: {1:6f}, R2: {2:6f}, \ standard error: {3:6f}.".format( b_ss, a_ss, r_ss**2, std_err_ss ) ) beta_tot_OLS = np.cov(X, Y, ddof=1)[0, 1] / np.var(X, ddof=1) alpha_tot_OLS = np.mean(Y) - beta_tot_OLS * np.mean(X) res = Y - beta_tot_OLS * X - alpha_tot_OLS std_eps_OLS = np.std(res, ddof=2) print("\nOr I can estimate by myself:\nbeta: {}, alpha: {}, sigma: {}".format(beta_tot_OLS, alpha_tot_OLS, std_eps_OLS))
We can use one of the many python functions: beta: 1.181743, alpha: 0.000058, R2: 0.428090, standard error: 0.045304. Or I can estimate by myself: beta: 1.1817426272884215, alpha: 5.772407285412843e-05, sigma: 0.007842859851366457

Come on....Let's compute everything: (with the help of the scalar product between vectors @)

n = len(X) x_mean = np.mean(X) y_mean = np.mean(Y) Sxx = (X - x_mean) @ (X - x_mean) Syy = (Y - y_mean) @ (Y - y_mean) Sxy = (X - x_mean) @ (Y - y_mean) beta_hat = Sxy / Sxx alpha_hat = y_mean - beta_hat * x_mean # beta and alpha print("beta: {} and alpha: {}".format(beta_hat, alpha_hat)) epsilon = Y - beta_hat * X - alpha_hat # residuals S2 = epsilon @ epsilon / (n - 2) S = np.sqrt(S2) # unbiased estimator print("sigma: {}".format(S)) var_beta = S2 / Sxx # variance of beta var_alpha = (X @ X) * S2 / (Sxx * n) # variance of alpha s_b = np.sqrt(var_beta) # standard error for beta s_a = np.sqrt(var_alpha) # standard error for alpha R2 = Sxy**2 / (Sxx * Syy) print("Standard Error for beta: ", s_b) print("R squared: {}".format(R2)) CI_b = ss.t.interval(0.95, n - 2, loc=beta_hat, scale=s_b) b_up = CI_b[0] b_down = CI_b[1] CI_a = ss.t.interval(0.95, n - 2, loc=alpha_hat, scale=s_a) print("Confidence intervals for beta: ", CI_b) print("Confidence intervals for alpha: ", CI_a)
beta: 1.1817426272884215 and alpha: 5.772407285412843e-05 sigma: 0.007842859851366457 Standard Error for beta: 0.04530406599209334 R squared: 0.42809009281099913 Confidence intervals for beta: (1.0928299019730954, 1.2706553526037476) Confidence intervals for alpha: (-0.0004599474292502449, 0.0005753955749585017)

The intercept α\alpha is (always) almost zero. Therefore it is not relevant for the future analysis.

Plot:
x = np.linspace(res.min(), res.max(), 100) # res are the residuals fig = plt.figure(figsize=(16, 5)) ax1 = fig.add_subplot(121) ax2 = fig.add_subplot(122) ax1.scatter(X, Y) ax1.set_title("Scatter plot X-Y") ax1.plot(X, alpha_hat + b_up * X, color="g", label="upper bound beta") ax1.plot(X, alpha_hat + b_down * X, color="y", label="lower bound beta") ax1.plot(X, alpha_hat + beta_hat * X, color="red", label="estimated beta") ax1.legend(loc="upper left") ax2.plot(x, ss.norm.pdf(x, 0, S), label="Normal distribution") params = ss.t.fit(res) ax2.plot(x, ss.t.pdf(x, loc=params[1], scale=params[2], df=params[0]), label="Student t distribution") ax2.hist(res, bins=50, density=True, facecolor="LightBlue") ax2.legend() ax2.set_title("Residual histogram") plt.show()
Image in a Jupyter notebook

The first thing we notice is that, although we removed the outliers, the Normal fit is still not so good. The distribution is better described by a Student t distribution with

print("Degrees of freedom of fitted t distribution: ", params[0])
Degrees of freedom of fitted t distribution: 7.247881899338388

However, it is much easier to mantain the Normal assumption, because it permits to use confidence intervals and... also... the Kalman filter!!

The Kalman filter

I do not have space and time to explain the meaning and the details of the Kalman Filter in this notebook. I assume the reader already has a basic knowledge. Otherwise, for a pedagogical exposition, I suggest to have a look at the tutorial mentioned at the top. The presentation on wikipedia, although very short, is also quite clear.

Important things to remember: the KF (Kalman Filter) assumes a linear dynamics for the true state xk\mathbf x_k with kNk \in \mathbb{N}, and Normally distributed errors.

xk=Fkxk1+wkwithwkN(0,Qk)zk=Hkxk+vkwithvkN(0,Rk)\begin{aligned} \mathbf x_{k} &= \mathbf F_k \mathbf x_{k-1} + \mathbf w_k & \text{with} \quad \mathbf w_k \sim \mathcal{N}(0, \mathbf Q_k) \\ \mathbf z_k &= \mathbf H_k \mathbf x_{k} + \mathbf v_k & \text{with} \quad \mathbf v_k \sim \mathcal{N}(0, \mathbf R_k) \\ \\\end{aligned}

The first equation is the process equation for the true state, where Fk\mathbf F_k is the state transition function (or matrix), and wk\mathbf w_k is the process noise with covariance Qk\mathbf Q_k. The second equation is the measurement equation, where zk\mathbf z_k are the measurement of the true state xk\mathbf x_k. The term Hk\mathbf H_k is the measurement function (or matrix), and vk\mathbf v_k is the measurement noise with covariance Rk\mathbf R_k.

The Kalman filter equations, taken from here, are:

Predict Stepx^kk1=Fkx^k1k1(a priori) state mean predictionPkk1=FkPk1k1FkT+Qk(a priori) state covariance predictionHelper variablesy~k=zkHkx^kk1pre-fit residualsSk=HkPkk1HkT+Rkpre-fit measurement covarianceKk=Pkk1HkTSk1Kalman GainUpdate Stepx^kk=x^kk1+Kky~k(a posteriori) state estimatePkk=(1KkHk)Pkk1(a posteriori) covariance estimatey~kk=zkHkx^kkpost-fit residuals\begin{aligned} \text{Predict Step}\\ \hat{\mathbf x}_{k\mid k-1} &= \mathbf F_k\hat{\mathbf x}_{k-1\mid k-1} \hspace{10em} \text{(a priori) state mean prediction} \\ \mathbf P_{k\mid k-1} &= \mathbf F_k \mathbf P_{k-1\mid k-1} \mathbf F_k^\mathsf T + \mathbf Q_k \hspace{6.5em} \text{(a priori) state covariance prediction} \\ \\ \text{Helper variables}\\ \tilde{\mathbf y}_k &= \mathbf z_k - \mathbf H_k\hat{\mathbf x}_{k\mid k-1} \hspace{9.3em} \text{pre-fit residuals}\\ \mathbf{S}_k &= \mathbf H_k \mathbf P_{k\mid k-1} \mathbf H_k^\mathsf T + \mathbf R_k \hspace{8em} \text{pre-fit measurement covariance} \\ \mathbf K_k &= \mathbf P_{k\mid k-1}\mathbf H_k^\mathsf T \mathbf{S}_k^{-1} \hspace{10em} \text{Kalman Gain}\\ \\ \text{Update Step}\\ \hat{\mathbf x}_{k\mid k} &= \hat{\mathbf x}_{k\mid k-1} + \mathbf K_k\tilde{\mathbf y}_k \hspace{9.4em} \text{(a posteriori) state estimate}\\ \mathbf P_{k|k} &= (\mathbb{1} - \mathbf K_k \mathbf H_k) \mathbf P_{k|k-1} \hspace{8em} \text{(a posteriori) covariance estimate} \\ \\ \tilde{\mathbf y}_{k\mid k} &= \mathbf z_k - \mathbf H_k\hat{\mathbf x}_{k\mid k} \hspace{10em} \text{post-fit residuals}\\ \\\end{aligned}

The notation x^kk1\hat {\mathbf {x} }_{k\mid k-1} represents the (a priori) estimate of x\mathbf {x} at time kk given an observation (a measurement) at time k1k-1. Instead, x^kk\hat {\mathbf {x} }_{k\mid k} represents the (a posteriori) estimate of x\mathbf {x} at time kk given an observation at time kk.

Comments:
  • Residuals are computed in the measurement space. The variable x\mathbf x needs to be transformed by H\mathbf H before the subtraction with z\mathbf z.

  • The (a priori) covariance is simply the conditional covariance of the process dynamics:

Pkk1=Cov[xkxk1]=Cov[Fkxk1xk1]+Cov[wkxk1]=FkCov[xk1xk1]FkT+Qk=FkPk1k1FkT+Qk\begin{aligned} \mathbf P_{k\mid k-1} = \text{Cov}[\mathbf x_k \mid \mathbf x_{k-1}] &= \text{Cov}[ \mathbf F_k \mathbf x_{k-1} \mid \mathbf x_{k-1}] + \text{Cov}[\mathbf w_k \mid \mathbf x_{k-1}] \\ &= \mathbf F_k \, \text{Cov}[\mathbf x_{k-1} \mid \mathbf x_{k-1}]\, \mathbf F_k^T + \mathbf Q_k \\ &= \mathbf F_k \, \mathbf P_{k-1\mid k-1}\, \mathbf F_k^T + \mathbf Q_k \end{aligned}
  • The term Sk\mathbf{S}_k is the conditional covariance of the measurement.

Sk=Cov[zkxk1]=Cov[Hkxkxk1]+Cov[vkxk1]=HkCov[xkxk1]HkT+Rk\begin{aligned} \mathbf S_{k} = \text{Cov}[\mathbf z_k \mid \mathbf x_{k-1}] &= \text{Cov}[ \mathbf H_k \mathbf x_{k} \mid \mathbf x_{k-1}] + \text{Cov}[\mathbf v_k \mid \mathbf x_{k-1}] \\ &= \mathbf H_k \, \text{Cov}[\mathbf x_{k} \mid \mathbf x_{k-1}]\, \mathbf H_k^T + \mathbf R_k \end{aligned}
  • Of course wk\mathbf w_{k} and vk\mathbf v_{k} are independent of xk1\mathbf x_{k-1}.

Marginal probability distribution

Let us recall the marginal distribution, where the hidden state variables xk\mathbf x_k are integrated out. If we define z=(z0,z1,...)\mathbf z = (\mathbf z_0, \mathbf z_1, ... ), we have:

P(z)=kN(zk;  Hkxkk1,Sk)\mathbb{P}(\mathbf z) = \prod_{k} \mathcal{N} \bigg( \mathbf z_k; \; \mathbf H_k \mathbf x_{k \mid k-1}, \mathbf S_k \bigg)

where Hkxkk1\mathbf H_k \mathbf x_{k \mid k-1} and Sk\mathbf S_k are the mean and covariance of normal distribution of zk\mathbf z_k.

Kalman regression model

Let us introduce the model for the Kalman regression. The state variable is xk=(αkβk)\mathbf x_k = \bigg( \begin{array}{c} \alpha_k \\ \beta_k \end{array} \bigg). These variables are hidden i.e. they are not measurable directly. We can only measure xkx_k and yky_k.

Notation issue!

Here I use the variable yky_k to indicate the response variable in the regression and not the residual variable of the Kalman filter. Above I wanted to use the same notation of wikipedia. Below, I indicate the Kalman residuals with r_k.

The model equations are:

process equation:
(αkβk)=(1001)(αk1βk1)+(ηkαηkβ)with(ηkαηkβ)N(0,Qk)andQk=(σα200σβ2)\biggl(\begin{array}{c} \alpha_k\\ \beta_k \end{array}\biggr) = \biggl(\begin{array}{cc} 1 & 0\\ 0 & 1 \end{array}\biggr) \biggl(\begin{array}{c} \alpha_{k-1}\\ \beta_{k-1} \end{array}\biggr) + \biggl(\begin{array}{c} \eta^{\alpha}_k\\ \eta^{\beta}_k \end{array}\biggr) \quad \text{with} \quad \biggl(\begin{array}{c} \eta^{\alpha}_k\\ \eta^{\beta}_k \end{array}\biggr) \sim \mathcal{N}(0, \mathbf Q_k) \quad \text{and} \quad \mathbf Q_k = \biggl(\begin{array}{cc} \sigma^2_{\alpha} & 0\\ 0 & \sigma^2_{\beta} \end{array}\biggr)
measurement equation:
$$y_k = \bigl(\begin{array}{cc} 1 & x_k \end{array}\bigr) \biggl(\begin{array}{c} \alpha_k\\ \beta_k \end{array}\biggr) + \epsilon_k \quad \text{with} \quad \epsilon_k \sim \mathcal{N}(0, \sigma^2_{\epsilon}) \\$$

We see that the transition matrix is the identity Fk=1\mathbf F_k = \mathbb 1, and the measurement matrix is Hk=(1xk)\mathbf H_k = \bigl(\begin{array}{cc} 1 & x_k \end{array}\bigr) . We also assume ηkα\eta^{\alpha}_k and ηkβ\eta^{\beta}_k uncorrelated.

Training set

I also rename the columns in order to clarify which column is the X and which is the Y. I overwrite the variables XX, YY used before.

Y = dret.columns[0] X = dret.columns[1]
training_size = 250 ret_train = dret.iloc[:training_size] ret_test = dret.iloc[training_size:]
beta_tr, alpha_tr, _, _, _ = ss.linregress(ret_train[X], ret_train[Y]) resid_tr = ret_train[Y] - beta_tr * ret_train[X] - alpha_tr var_eps_ols = resid_tr.var(ddof=2) # a possible initial guess for var_eps print("In the training set the OLS estimators of") print("alpha_tr = ", alpha_tr) print("beta_tr = ", beta_tr) print("var_eps = ", var_eps_ols)
In the training set the OLS estimators of alpha_tr = 0.0002276440411573649 beta_tr = 0.9751772551661367 var_eps = 6.286775484858256e-05

Rolling alpha and beta

Here I calculate the OLS α\alpha and β\beta in the test set, using a rolling window.

rolling_window = 50 rolling_beta = [] rolling_std_err = [] rolling_alpha = [] for i in range(len(ret_test)): ret_temp = ret.iloc[1 + i + training_size - rolling_window : 1 + i + training_size] beta_temp, alpha_temp, _, _, std_err_temp = ss.linregress(ret_temp[X], ret_temp[Y]) rolling_beta.append(beta_temp) rolling_std_err.append(std_err_temp) rolling_alpha.append(alpha_temp) ret_test = ret_test.assign(ols_beta=np.array(rolling_beta)) ret_test = ret_test.assign(ols_std_err=np.array(rolling_std_err)) ret_test = ret_test.assign(ols_alpha=np.array(rolling_alpha))

Implementation of the Kalman filter for α\alpha and β\beta

In the following cell I decided to initialize the state x0=(α0β0)\mathbf x_0 = \bigg( \begin{array}{c} \alpha_0\\ \beta_0 \end{array} \bigg) using the OLS values calculated with the data in the training set. The initial covariance matrix is set to P0=0.11P_0= 0.1 \cdot \mathbb{1}. The variance parameters are set to σα2=0.0000001\sigma^2_{\alpha}=0.0000001 and σβ2=0.01\sigma^2_{\beta}=0.01.

Intuitively, it makes sense to design the β\beta process dynamics with a standard deviation σβ=0.1\sigma_{\beta}=0.1. The error describes how the beta changes over time, e.g. if σβ=0\sigma_{\beta}=0 the beta remains constant, if σβ=100\sigma_{\beta}=100 the beta can assume unreasonable values. If, for instance, the current value of beta is 11, with 68.3%68.3\% probability at the next time step the beta will be inside the interval [0.9,1.1][0.9, 1.1] ( by the three-sigma rule). The same argument works for the choice of the variance of α\alpha.

The parameter σϵ2\sigma^2_{\epsilon} is set equal to the OLS standard deviation estimate, computed in the training set.

All these values must be calibrated using the training dataset!. This is a difficult task because the variables are not observable! In general, the filter designer has an important role in the selection of the process and model errors. The parameters should be chosen in order to avoid overfitting/underfitting and to avoid unrealistic values.

Possible methods (but there are more) are:

  • (1) Using reasonable intuitive assumptions (such as the assumption I used above)

  • (2) Compute the rolling α\alpha and β\beta in the training set and calculate the variances. Nevertheless, these values will depend on the selected time window.

  • (3) Using MLE method. Later I will apply it in a simplified model.

A useful thing to do, is to create a validation set. This is an additional set that can be created as a part of the training set. (I will not make use of it in this notebook, although it can be very useful) We can run the Kalman filter on the validation set and check if the estimated parameters produce good results. For instance, we could verify the intuitive assumption in (1), the choice of the rolling window in (2), and the quality of the MLE estimator. In case of bad results, we can repeat the estimation in the training set under different assumptions.

Important!

If you have changed the tickers, the parameters described above need to be changed as well.

x = np.array([[alpha_tr], [beta_tr]]) # initial mean of alpha and beta P = 0.1 * np.eye(2) # initial covariance of alpha and beta eta_alpha = 0.0000001 # variance of the alpha process eta_beta = 0.01 # variance of the beta process Q = np.diag([eta_alpha, eta_beta]) # process covariance matrix R = np.array([[var_eps_ols]]) # variance for the measurement - 1 dim F = np.eye(2) # transition matrix H = np.vstack((np.ones(len(ret_test)), ret_test[X].values)).T # measurement matrix
alphas, betas = np.full(len(ret_test), np.NaN), np.full(len(ret_test), np.NaN) for i, y in enumerate(ret_test[Y]): # predict x = F @ x P = (F @ P @ F.T) + Q # update S = (H[None, i] @ P @ H[None, i].T) + R # Here S is a scalar K = (P @ H[None, i].T) @ inv(S) r = y - H[None, i] @ x # residuals x = x + K @ r # v is a scalar. P = P - K @ H[None, i] @ P # K @ H[None,i] is an outer product alphas[i] = x[0, 0] betas[i] = x[1, 0]
plt.figure(figsize=(16, 6)) plt.plot(ret_test.index, alphas, label="Kalman Filter") plt.plot(ret_test["ols_alpha"], label="OLS") plt.legend() plt.title("Comparison Alphas") plt.show()
Image in a Jupyter notebook
plt.figure(figsize=(16, 6)) plt.plot(ret_test.index, betas, label="Kalman Filter") plt.plot(ret_test["ols_beta"], label="OLS") plt.legend() plt.title("Comparison Beta") plt.show()
Image in a Jupyter notebook

Comments

  • Looking at the two plots above, we can see that the curves are very similar. An important thing to notice is that the OLS is shiftet on the right i.e. OLS is delayed! The Kalman estimates can reflect much better the true (spot) value of the state.

  • The filter is very robust with respect to initial guesses. If, for instance, you try to set P = 100 * np.eye(2), you will see that it will converge quickly to a "reasonable" value.

  • In the matrix H[None,i] I had to use the keyword None in order to maintain the information of the array dimensions, H[None,0].shape = (1,2). This is a bad feature of the language. Python automatically removes dimensions of a subarray i.e. H[0].shape = (2,).

  • K @ H[None,i] is an outer product of dimension (2,2)=(2,1)×(1,2)(2,2) = (2,1) \times (1,2).

  • I used inv(S) to compute the inverse of S. In the notebook A1 I repeated many times that inverting matrices is bad. But for the Kalman filter, since usually S has small rank (here S is a scalar), we can make an exception 😃

PROS

The Kalman filter does not need an entire dataset (or time window) to estimate the current state. It just needs ONE measurement! The OLS estimate, instead, depends on the length of the time window, and it never reflects the spot value of the state because it is affected by past data.

CONS

The Kalman filter performances are strongly dependent on the process and measurement errors!! These values are not always measurable, and are hard to calibrate.

Kalman regression for the β\beta

We have seen that α0\alpha \approx 0 and that it does not vary that much. For this reasons, the parameter alpha is usually not considered as a relevant parameter.

Comment:

This is in accordance with the well known Capital Asset Pricing Model CAPM, that assumes α=0\alpha=0.

Therefore we can simplify the process dynamics by assuming that it is a simple constant αk=α\alpha_k = \alpha. For simplicity we can set α=0\alpha=0, but now I want to present the general case α0\alpha \not= 0.

Process equation:
(αβk)=(1001)(αβk1)+(0ηk)withηkN(0,ση2)\biggl(\begin{array}{c} \alpha\\ \beta_k \end{array}\biggr) = \biggl(\begin{array}{cc} 1 & 0\\ 0 & 1 \end{array}\biggr) \biggl(\begin{array}{c} \alpha\\ \beta_{k-1} \end{array}\biggr) + \biggl(\begin{array}{c} 0\\ \eta_k \end{array}\biggr) \quad \text{with} \quad \eta_k \sim \mathcal{N}(0, \sigma^2_{\eta})
Measurement equation:
$$y_k = \bigl(\begin{array}{cc} 1 & x_k \end{array}\bigr) \biggl(\begin{array}{c} \alpha\\ \beta_k \end{array}\biggr) + \epsilon_k \quad \text{with} \quad \epsilon_k \sim \mathcal{N}(0, \sigma^2_{\epsilon}) \\$$

Now, we can obtain a simple expression for the Kalman filter variables:

Predict:
βkk1=βk1k1andPkk1=Pk1k1+ση2.\beta_{k \mid k-1} = \beta_{k-1 \mid k-1} \quad \text{and} \quad P_{k \mid k-1} = P_{k-1 \mid k-1} + \sigma_{\eta}^2.
Variables:
rk=ykαβkk1xkr_k = y_k - \alpha - \beta_{k \mid k-1} x_kSk=(1xk)(000Pkk1)(1xk)+σϵ2=Pkk1xk2+σϵ2S_k = \bigl(\begin{array}{cc} 1 & x_k \end{array}\bigr) \biggl(\begin{array}{cc} 0 & 0 \\ 0 & P_{k \mid k-1} \end{array}\biggr) \biggl(\begin{array}{c} 1\\ x_k \end{array}\biggr) + \sigma^2_{\epsilon} = P_{k \mid k-1} x_k^2 + \sigma^2_{\epsilon}Kk=1Sk(000Pkk1)(1xk)=1Sk(0xkPkk1)K_k = \frac{1}{S_k} \biggl(\begin{array}{cc} 0 & 0 \\ 0 & P_{k \mid k-1} \end{array}\biggr) \biggl(\begin{array}{c} 1\\ x_k \end{array}\biggr) = \frac{1}{S_k} \biggl(\begin{array}{c} 0\\ x_k P_{k \mid k-1} \end{array}\biggr)
Update:
(αβkk)=(αβkk1)+1Sk(0rkxkPkk1)\biggl(\begin{array}{c} \alpha\\ \beta_{k \mid k} \end{array}\biggr) = \biggl(\begin{array}{c} \alpha \\ \beta_{k \mid k-1} \end{array}\biggr) + \frac{1}{S_k} \biggl(\begin{array}{c} 0\\ r_k x_k P_{k \mid k-1} \end{array}\biggr)(000Pkk)=[(1001)1Sk(0xkPkk1)(1xk)](000Pkk1)\biggl(\begin{array}{cc} 0 & 0 \\ 0 & P_{k \mid k} \end{array}\biggr) = \biggl[ \biggl(\begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array}\biggr) - \frac{1}{S_k} \biggl(\begin{array}{c} 0\\ x_k P_{k \mid k-1} \end{array}\biggr) \bigl(\begin{array}{cc} 1 & x_k \end{array}\bigr) \biggr] \biggl(\begin{array}{cc} 0 & 0 \\ 0 & P_{k \mid k-1} \end{array}\biggr)

Considering the beta-component only, the two update equations are:

βkk=βkk1+1SkrkxkPkk1andPkk=Pkk1(1Kkxk)\beta_{k \mid k} = \beta_{k \mid k-1} + \frac{1}{S_k} r_k x_k P_{k \mid k-1} \quad \text{and} \quad P_{k \mid k} = P_{k \mid k-1} \biggl( 1- K_k x_k \biggr)

Calibration

We can calibrate the two parameters ση\sigma_{\eta} and σϵ\sigma_{\epsilon} using the data of the training set.

The idea is to use the MLE method i.e. to maximize the following

logL(ση,σϵyk,yk1,...,y1,y0)=12k=1N(log2π+logSk+rk2Sk)\log L(\sigma_{\eta},\sigma_{\epsilon} \mid y_k, y_{k-1}, ..., y_1, y_0) = -\frac{1}{2} \sum_{k=1}^N \biggl( \log 2\pi + \log S_k + \frac{r_k^2}{S_k} \biggr)

which is the log-likelihood obtained from Normal distributions. For a derivation of this formula see wiki.

Let us recall that all the measurements are independent. When the measurement equation is

yk=α+βkxk+ϵky_k = \alpha + \beta_k x_k + \epsilon_k

the variables yky_k are normally distributed with mean

E[ykβk1,k1]=α+βk,k1xk\mathbb{E}[ y_k \mid \beta_{k-1,k-1}] = \alpha + \beta_{k,k-1} x_k

and variance

Var[ykβk1,k1]=xk2Pk,k1+σϵ2=Sk.\text{Var}[ y_k \mid \beta_{k-1,k-1}] = x_k^2 P_{k,k-1} + \sigma_{\epsilon}^2 = S_k.

The algorithms for the beta regression and the calibration based on the MLE method are implemented in the file Kalman_filter.py.

Calibration using MLE
KR = KF.Kalman_regression(ret_train[X], ret_train[Y])
alpha0, beta0 and var_eps initialized by OLS

Log-likelihood plot

Log-likelihood can be ugly functions in general!

It is always better (when possible) to visualize the plot. It will be easier to understand if it has a global maximum. In general the log-likelihood can be flat, or can have many local maximum points.

In our case the log-likelihood has the following shape:

dimens = 40 var_eps_x = np.linspace(1e-5, 2e-4, dimens) var_eta_y = np.linspace(1e-5, 2e-4, dimens) X_grid, Y_grid = np.meshgrid(var_eps_x, var_eta_y) V_grid = np.ones((dimens, dimens)) for i, i_eps in enumerate(var_eps_x): for j, j_eta in enumerate(var_eta_y): KR.var_eps = i_eps KR.var_eta = j_eta KR.run() V_grid[j, i] = KR.loglikelihood KR.set_OLS_params() V = np.ones(dimens) for j, j_eta in enumerate(var_eta_y): KR.var_eta = j_eta KR.run() V[j] = KR.loglikelihood
fig = plt.figure(figsize=(16, 5)) gs = gridspec.GridSpec(1, 2, width_ratios=[3, 2]) ax = fig.add_subplot(gs[0], projection="3d") ax2 = fig.add_subplot(gs[1]) ax.plot_surface(Y_grid, X_grid, V_grid) # , cmap=cm.ocean ax.set_title("Log-likelihood function of $\sigma_{\eta}^2$ and $\sigma_{\epsilon}^2$") ax.set_xlabel("$\sigma_{\eta}^2$ (var_eta) ", labelpad=9) ax.set_ylabel("$\sigma_{\epsilon}^2$ (var_eps)", labelpad=13) ax.set_zlabel("Log-Likelihood", labelpad=7) ax.view_init(30, 10) # this function rotates the 3d plot ax.ticklabel_format(axis="x", style="sci", scilimits=(-4, 2)) ax.ticklabel_format(axis="y", style="sci", scilimits=(-4, 2)) ax2.plot(var_eta_y, V) ax2.set_title("Log-likelihood function of $\sigma_{\eta}^2$ with fixed $\sigma_{\epsilon}^2$") ax2.set_xlabel("$\sigma_{\eta}^2$ (var_eta)", labelpad=8) ax2.ticklabel_format(axis="x", style="sci", scilimits=(0, 0)) plt.show()
Image in a Jupyter notebook
Comments:
  • The log-likelihood is a concave function of ση2\sigma_{\eta}^2. Therefore there is a global maximum.

  • Sometimes, it happens that the log-likelihood is a decreasing function of ση2\sigma_{\eta}^2. Therefore the maximum is at ση2=0\sigma_{\eta}^2 = 0. In this case the value doesn't make sense. If we use ση=0\sigma_{\eta}=0, it means that we are considering the process model as a perfect model (with no error). The consequence is that the filter will hardly be affected by new measurements.

  • The log-likelihood is concave with respect to σϵ2\sigma_{\epsilon}^2 and has a maximum point (for each fixed ση2\sigma_{\eta}^2). This is very good! We can see that the MLE estimator is very close to the OLS estimator!!

KR.calibrate_MLE() print("beta_last = ", KR.betas[-1], "P_last = ", KR.Ps[-1]) print("var_eta_MLE = ", KR.var_eta, "var_eps_MLE = ", KR.var_eps, "var_eps_OLS = ", var_eps_ols) KR.run(ret_test[X], ret_test[Y]) print("The post-fit R squared is: ", KR.R2_post_fit)
Optimization converged successfully var_eps = 6.250650685447434e-05, var_eta = 6.250079206505446e-05 beta_last = 1.0132897275906843 P_last = 0.012048131591893373 var_eta_MLE = 6.250079206505446e-05 var_eps_MLE = 6.250650685447434e-05 var_eps_OLS = 6.286775484858256e-05 The post-fit R squared is: 0.4532263803195352
plt.figure(figsize=(16, 6)) plt.plot(ret_test.index, betas, label="Kalman Filter alpha-beta") plt.plot(ret_test["ols_beta"], label="OLS") plt.plot(ret_test.index, KR.betas, label="Kalman filter with MLE $\sigma_{\eta}^2$") plt.legend() plt.title("Comparison Beta using MLE parameters") plt.show()
Image in a Jupyter notebook
ret_test = ret_test.assign(Kalman_beta=KR.betas) ret_test = ret_test.assign(Kalman_err=np.sqrt(KR.Ps)) ret_test.head()
plt.figure(figsize=(16, 6)) plt.plot(ret_test["Kalman_err"], label="Kalman Filter error ($\sqrt{P}$)") plt.plot(ret_test["ols_std_err"], label="OLS error") plt.legend() plt.title("Comparison Beta errors") plt.show()
Image in a Jupyter notebook

The error comparison above is only indicative. Since the OLS error σ2Sxx\sqrt{ \frac{\sigma^2}{S_{xx}} } can be set arbitrarily small by simply choosing a bigger time window (the more data we have, the higher SxxS_{xx} will be).

Quite good!

Using the Kalman filter we obtain very smooth values of beta. The estimation error is smaller and smoother as well.

What about the R2R^2 ?

A possible alternative to the MLE, could be to maximize the R2R^2. Let us see if it works.

The following function considers a fixed σϵ\sigma_{\epsilon} and searches for the optimal value of ση\sigma_{\eta} in the range bounds=[[1e-15,1]].

There are two reasons to calibrate only one parameter.

  • The first reason is that we have already seen that σϵ\sigma_{\epsilon} obtained with OLS is a good value. It is very close to the MLE estimate.

  • The second reason is that the R2R^2 is maximized at σϵ=0\sigma_{\epsilon} = 0 and for a high value of ση\sigma_{\eta}. When the measurement error is zero, it means that the measurements are perfect and the Kalman filter simply selects the values of beta that solve the linear equation yk=α+βkxky_k = \alpha + \beta_k x_k. Of course ση\sigma_{\eta} should be big enough to allow fast changes of beta. In this case we have an overfitting problem. Using a high ση\sigma_{\eta} also means that our process model has a very low predictive power.

Let us see what happens when calibrating only ση\sigma_{\eta}.

KR.calibrate_R2(mode="pre-fit")
Optimization converged successfully var_eta = 1e-15
KR.calibrate_R2(mode="post-fit")
Optimization converged successfully var_eta = 1.0
  • Considering the pre-fit residuals, the optimizer returns the output ση2=1015\sigma_{\eta}^2 = 10^{-15}, which is the lowerbound imposed by myself. This is not an acceptable value.

  • Considering the post-fit residuals, the output is the upper bound in the parameter range bounds=[[1e-15,1]]. Overfitting

By looking at the picture below, we see that the R2R^2 is an increasing function of ση\sigma_{\eta}, i.e. there is no optimal ση\sigma_{\eta}.

xx = np.linspace(0.000001, 10, 100) RR2 = [] for i in xx: KR.run(var_eta=i, var_eps=var_eps_ols) RR2.append(KR.R2_post_fit) plt.plot(xx, RR2, label="$R^2$") plt.title("R squared of the Kalman filter") plt.xlabel(" var_eta") plt.legend() plt.show()
Image in a Jupyter notebook
Well... The R2R^2 maximization is not a good idea! It doesn't work.

Let us see what happens if we run the Kalman filter in the test set using a big value of ση\sigma_{\eta} and a small value of σϵ\sigma_{\epsilon} (here I set σϵ2\sigma_{\epsilon}^2 to var_eps_ols).

var_eta_R2 = 200 KR.run(ret_test[X], ret_test[Y], var_eta=var_eta_R2, var_eps=var_eps_ols) print("The post-fit R squared is: ", KR.R2_post_fit)
The post-fit R squared is: 0.9572452482415568
plt.figure(figsize=(16, 6)) plt.plot(ret_test.index, betas, label="Kalman Filter alpha-beta") plt.plot(ret_test["ols_beta"], label="OLS") plt.plot(ret_test.index, KR.betas, label="Kalman filter with huge $\sigma_{\eta}^2$") plt.legend() plt.title("Comparison Beta using $R^2$ parameters") plt.show()
Image in a Jupyter notebook

The R2R^2 is very close to 1, which means almost a perfect fit. However the obtained values of β\beta are unreasonable.

Conclusions

  • We have seen that using the R2R^2 for the calibration is not a good idea because it produces overfitting.

  • The MLE sometimes works, sometimes not. Most of the time, the problematic parameter to estimate is ση\sigma_{\eta}.

  • The OLS estimator for σϵ\sigma_{\epsilon} is very close to the MLE for σϵ\sigma_{\epsilon}.

  • The parameter ση\sigma_{\eta} must be chosen by the filter designer, using reasonable assumptions.

References

[1] Casella, George and Berger, Roger. (2001) "Statistical inference" Duxbury Advanced Series