Path: blob/master/5.3 Volatility tracking.ipynb
1675 views
Volatility tracking
This is another notebook where I use the Kalman filter to track hidden variables. This time it's the turn of volatility! Well, actually I will track the variance, but the concept is the same.
I will use the values simulated by a Heston process, and implement several methods to track the (hidden) volatility process. We will compare the simple "rolling" method with the very famous GARCH(1,1) method, and we will see that the Kalman filter is slightly superior to both of them.
This notebook also contains a small digression on the Variance Gamma process, which however is not related to the main theme and therefore you can skip it if you are not interested.
In this notebook, I will make use of the ARCH library.
Contents
Heston path generation
Let us consider a stock price following a Heston process. We introduced the Heston process in the notebook 1.4, but it is better to recall the SDE:
The process is the variance process and follows a CIR process (see notebook 1.2).
The parameters are:
drift of the stock process
mean reversion coefficient of the variance process
long term mean of the variance process
volatility coefficient of the variance process
correlation between and .
I will simulate the process considering N=2500
time points and T=10
years. This is quite realistic as in ten years there are about 2500 days.
Let us generate the path using the function Heston_process.path
.
Plot:
Fit of Variance Gamma and Student t
I will explain in the VG Section why I chose the Variance Gamma distribution among many others.
Let us compute the mean and standard deviation for both log-returns and linear returns. I present the normalized values i.e. the values computed for . Adding the approximated Itô correction , the two means becomes almost equal (see below for a better explanation).
Below, you can see the returns histogram with some fitted distributions, the qq-plot and the plot of the log-return process. As expected, the normal distribution underestimates the mass around the mean. From the histogram it is difficult to notice the presence of heavy tails. For this reason I included the qqplot as well. The plot of the log-returns dynamics includes the mean and standard deviation levels. It clearly exhibits volatility clustering.
Plot
From now on, the quantity of interest will be the stock return . As we have seen above, there is no big difference between using log-returns or linear returns.
Since we will work with a time series of data, it is better to consider a discrete-time model (with ), that can be derived from the discretization of the SDE we have seen before.
Linear returns, :
We can discretize the equation of the process in the Heston SDE introduced on top of this notebook (the Euler-Maruyama discretization procedure is presented in the notebook 1.2):
with .
Log-returns, :
You should make a change of variable into the Heston SDE (see what I did in the notebook 1.4) and then discretize.
In the linear returns equation we have a constant drift , but in the log-return equation the drift is stochastic.
Since we are interested in the volatility, there is no need of a drift term. It turns out that it is almost always possible to detrend a time series. If the trend is constant we can simply define . If the trend is stochastic, detrending is more complicated but still possible. In particular, in the process above the effect of the stochasticity in the trend is negligible for big values of . Since the trend is a mean reverting process, we can approximate the mean with its long term mean . Let us check it is correct. From the CIR conditional expectation formula, we can use the tower property:
For ( much bigger than zero) we can write where the dependency on the initial information decays. Solving the equation we get .
We can detrend the series by redefining . I called the term , approximated Itô correction.
We obtain a unified model that works for both linear and log returns (we can suppress the tilde):
with for each .
Stationarity tests:
The Augmented Dickey-Fuller test wiki The null hypothesys is that the process is unit root.
The Phillips-Perron test wiki The test is robust with respect to unspecified autocorrelation and heteroscedasticity. The null hypothesys is that the process is unit root.
Kwiatkowski–Phillips–Schmidt–Shin tests wiki The null hypothesys is that the process is NOT unit root.
Let us also conside the Jarque-Bera test for normality and the Ljung-Box test for autocorrelation.
Comments:
The plot of the log-return dynamics suggests that the series is weak stationary (i.e. with constant unconditional mean and variance), but has non-constant conditional variance. This time series clearly exhibits volatility clustering. The first three tests all confirm the weak stationarity! Since the process has conditional heteroscedasticity, the most appropriate test to use in this case is the PhillipsPerron, which is very robust with respect to nonconstant variances. In [1], it is shown that smooth and periodic variations in the variance have a negligible effect on the ADF test output, which makes it quite reliable as well.
The Jarque-Bera test confirms what we saw in the histogram and in the qqplot i.e. that the log-returns distribution is not normal.
The Ljung-Box test on the log-returns shows high p-values. Thus we cannot reject the independence hypothesis (we know that log-returns are independent). However, the same test applied to the squared log-returns shows very small p-values, confirming the presence of autocorrelation. Volatility clustering can be detected by checking for correlations between the squared returns.
Parameters choice
When I chose the parameters of the CIR process , I had to select the parameters , and such that the stationarity and the volatility clusters can be detected from the simulated data. The mean reversion parameter should be big enough to make the process stationary. On the other hand, if is too big, the variance process reverts too fast towards its long term mean, and the log-returns do not exhibit the volatility clusters. If is too small, the process always stays near the long term mean, and clusters do not form.
The ADF test shows that the process is indeed stationary.
We don't know the solution of the CIR equation, but we know its conditional expectation
with . In order to have stationarity we must have . A small makes and the process becomes similar to a unit root.
From the last expression, we see that also the size of matters. When then . Choosing bigger value of , may create problems in the detection of stationarity. This is a consequence of the fact that for short time scales, a mean-reverting process is indistinguishable from a random walk (remember that in an autoregressive model when the process is unit root).
A digression on the Variance Gamma (VG) process
Above I plotted the log-returns histogram together with some fitted distributions: the normal, the Student-t and the VG distributions. You may ask yourself why I decided to include the VG distribution in the plot. If you look carefully, you can see that it fits the data really good!
If you need to review the VG process, have a look at the notebook 1.5. Important to know, is that the process is obtained by Brownian subordination (i.e. the time in the Brownian motion is assumed to be a gamma process) and is defined as:
where and . I'm using the same letters used for the Heston parameters, but remember that they have a different meaning: and are the drift and std dev of the Brownian motion, and is the variance coefficient of the Gamma process.
Alright, let us try to answer the following three questions:
How good is the VG fit?
How can we measure the quality of the fit?
Why the VG process among many others?
In order to answer the first two questions, let us define the Empirical-Cumulative-Distribution-Function ECDF:
Unlike the histogram, the ECDF had the advantage that it does not depend on the number of bins. We can compare the ECDF with the Normal, Student-t and VG theoretical cumulative distributions.
Plot
Looking at the plot above, we can understand better which distribution fits better the data.
In order to have a quantitative measure of the goodness of fit, we can calculate the Euclidean distance between the theoretical and the empirical CDFs.
Now we have the confirmation that the VG distribution works very well!!
Ok, but are there any theoretical motivations?
The answer is yes!
The variance of the Heston process follows the CIR process. I introduced the CIR process in the notebook 1.2, where we saw that the conditional distribution of the process is a non-central chi-squared distribution. However, as time becomes large, the asymptotic distribution tends to the Gamma distribution:
where and .
In the VG process , the variance of the Brownian motion is Gamma distributed. I will use an heuristic Brownian approximation. Since and , we can define and write:
where, by the law of total variance, . The approximated model has same mean and same variance of the original VG process.
The process is known as the symmetric Variance Gamma process (the parameter is the responsible for the skewness, if the density is symmetric).
The distribution of the approximated process can be obtained from VG_pdf
, the VG distribution , by setting the parameters
In the plot above, we plotted both the VG and approximated VG densities. It is very hard to distinguish them. In this case the approximation is very good!
The conditional variance of the symmetric VG is:
Therefore... In both cases 1) and 2), the variance is described by a Gamma distribution. Let us plot the two Gammas, together with a Gamma distribution fitted from the data.
The first plot shows the two theoretical Gammas obtained from the VG and CIR models. It also shows a Gamma distribution with parameters fitted from the data.
The second plot compares the cumulative density functions.
In the third plot, you can see the simulated VG increments (with parameters fitted from the Heston log-returns).
I decided to plot the VG dynamics because it is important to understand that, although the VG density is very good to describe the empirical log-returns density, the VG process is NOT able to describe the statistical feature of the Heston process i.e. volatility clusters!!
Important: The plots above depend on the initial choice of the parameters , and .
If you try to modify the parameters, for instance, , and , you will find out that the CIR-asymptotic-Gamma distribution is a much better fit than the VG-Gamma distribution. Other parameters can imply the opposite.
Conclusive comment:
For any choice of the CIR parameters, the VG distribution is always better than the Normal or Student-t, when fitting the log-returns!
In the previous section we obtained the stochastic model for the returns: as a result of the explicit discretization of stock process SDE. However, the most common discrete time models, prefer to consider the expression:
where usually can be deterministic or stochastic. Usually, in deterministic models such as ARCH and GARCH the variance is a function of variables at previous times.
The difficult task is to estimate the variance value . Let us try two common methods:
Let us consider the GARCH(1,1) model:
where . The parameter is the long term variance (unconditional variance). The weights must satisfy: , and . After having fitted the parameters , and , it is possible to compute the other parameters as:
If the unconditional variance is not well defined. And this means that the GARCH process is not stationary anymore i.e. it is unit root.
Usually the GARCH(1,1) is sufficient to capture the volatility clustering in the data. There is no need for an higher order GARCH(p,q) with .
Another possibility is to compute the rolling variance. Using a time window of points, we can compute:
Computing the GARCH with the arch library
The arch library gives the possibility to estimate the mean as well:
But we don't care about it, and we detrend the series:
Comments on the output below:
I chose a quite big value
train_size = 1500
. If we want to compare this number with realistic data, this corresponds to 6 years of daily data. The reason of this big number is that in order to compute the parameter (proportional to the long term variance) with a good precision (small standard error), we need a lot of data.If you set
mean="constant"
the parameter will be estimated. You will find a very close to zero, or a big p-value.You can try to change the GARCH values of
p
andq
. However, in this example, produce the smallest AIC i.e. provide the best model.The optimizer requires to rescale the data in some cases, see discussion on stackoverflow It is possible to set
rescale=True
to automatically rescale the data, but I prefer to do it manually by introducting ascale_factor=1/dt
.If you try to set
dist="StudentsT"
, you will obtain a quite big value of estimated degrees of freedom , confirming that the data was generated by a Gaussian error.
If , as is typically the case, then the model is not very sensitive to the term for short horizons, however, becomes important for longer horizons.
Computing the GARCH from scratch
I implemented the GARCH(1,1) class in Processes.py. The class contains the methods:
path
: It uses a Monte Carlo simulation to generate a GARCH(1,1) process.log_likelihood
: It computes the log-likelihood function, and optionally, it returns the last variance value.fit_from_data
: It fits the GARCH(1,1) parameters from a time series.generate_var
: It generates the variance process from a series of returns.
Let us create a garch process and fit the parameters.
The output of my algorithm is a little different from that of the arch library:
I didn't check the settings in the
arch library
. I only know we are both usingscipy.optimize.minimize
with SLSQP to find the minimum of the negative log-likelihood.
I used the formula for the (Gaussian) log-likelihood:
where the variance is calculated ricursively, and I assumed as initial value .
The arch library uses a robust method for the estimation of the covariance matrix of the parameters. The provided standard errors are thus robust. Instead, I simply implemented the formula:
where
is the Fisher Information matrix. The observed Fisher information matrix is simply .
The inverse of the (negative) Hessian of the log-likelihood is an estimator of the asymptotic covariance matrix. Hence, the square roots of the diagonal elements of covariance matrix are estimators of the standard errors.
I used statsmodels.tools.numdiff.approx_hess
for the numerical computation of the Hessian.
We can check which set of parameters is better:
It seems that the ARCH implementation provides better parameters in terms of maximum likelihood.
I compared my implementation with ARCH on several simulated dataset, and the results are always very similar. Of course, as expected, the ARCH is more accurate, as we can see from the results above.
We can use the parameters fitted from the ARCH method fit()
to track the variance, using the GARCH recursion algorithm generate_var
.
Model linearization:
Let us consider the model for the returns defined before:
It is quite useful to introduce an additional scale parameter , such that the model becomes:
This model is multiplicative. The first thing that comes to mind is to take the log transform to make it linear!! If you remember, already in the notebook 1.4 we found useful to define the log-variance .
In order to remove the square root, we can square the equation on both sides, and then take the logarithm:
Where we called .
We want to create a state space model with an hidden state variable (see notebook 5.1 for a review of the general theory). The hidden variable in this model is the log-variance. We can assume it follows an autoregressive process:
with . However, the resulting system (state equation in and measurement equation in ) is NOT a linear Gaussian state space model. The reason is that the random variable
is not Gaussian, but follows the Log-Chi-Squared distribution with 1 degree of freedom. It satisfies and .
In order to make the system a linear Gaussian state space model, we can approximate
Let us check if this approximation is good:
Well... it doesn't seem so good at first sight. But it is proved in many articles (see references in [3] and [4]) that this approximation is good enough to produce a nice output. Indeed, this method is quite used in practice.
Gaussian linear state space model:
State equation:
Measurment equation:
Kalman filter:
Let us identify the Kalman filter elements. You can find them in the notebook 5.1 or in the wiki page.
In this model and are constant. They do not depend on .
Predict step:
Auxiliary variables:
Update step:
In order to familiarize with these formulas, let us compute the conditional expectation and variance of the innovations i.e. the new measurements:
Log-likelihood
The parameters can be estimated using the QML method (Quasi-Maximum-Likelihood), where the "Quasi" refers to the Gaussian approximation of . The log-likelihood function is:
Above I chose an initial hidden state equal to the logarithm of the variance computed in the training set. Since we have no knowledge of the true value of , it is better to select a huge value of the initial error .
In the function below, I indicate with scale
, and the residuals with r
.
The log-likelihood function may have multiple local minima. For this reason, remember to repeat the minimization several times with different initial conditions. (For instance, if you try x0=[0.1, 0.5, 1]
you will get different locally optimal values)
Parameter estimation:
Kalman Smoother:
Let's implement the RTS smoother:
Plots:
It is not easy to compare the methods by looking at the plot. Let's have a look at the MSE (mean square error):
The Kalman filter approach is the best!
References
[1] Giuseppe Cavaliere (2006) - "Unit root tests under time-varying variances"
[2] Taylor, S. J. (1986) - "Modelling Financial Time Series"
[3] Harvey, A. C., and Shephard, N. (1993) - "Estimation and Testing of Stochastic Variance Models"
[4] Harvey, A. C., Ruiz, E., and Shephard, N. (1994) - "Multivariate Stochastic Variance Models"
[5] Ruiz, E. (1994) - "Quasi-Maximum Likelihood Estimation of Stochastic Volatility Models"