Jupyter notebook Class Notes/Final Project Folder/SPY_final_project.ipynb
Predicting the daily returns and directions of returns of SPY using technical indicators and fundamental indicators.
By Seine Yumnam - 2017
In this project, I will be working on predicting the daily returns / direction of returns of SPY to ultimately build a trading strategy. Daily returns are the percentage changes on a day to day basis. Direction is simply the sign of the percentage change, indicating an up or a down day. SPY is an Exchange Traded Fund that tracks S&P500. It is a stock of stocks. We will be using multiple technical and fundamental indicators. The technical indicators are calculated in this notebook; however, the fundamental indicators are based on a paper titled "A Practitioner's Defense of Return Predictability" by Blair Hull and Xiao Qiao. The paper can be found at SSRN - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2609814
Fortunately, Hull Investments also provides the data they used in the paper. The excel file can be found here - http://www.hullinvest.com/HI/our-approach/
What are technical indicators?
Technical indicators are mathematical manipulation of the price and volume series of a stock.
What are fundamental indicators?
Fundamental indicators are the features of a company such as their revenue, earnings, cost of raw materials, profits, etc.
Trading Strategy
Before we start cleaning data, creating new variables and running the machine learning algorithm, let's quickly go through the trading strategy we are planning to build. The process is as follows -
Design models for predicting the daily returns of SPY or the direction of the returns.
At time t, if the model predicts that the return for t+1 will be positive, then we go long SPY at market close.
At time t, if the model predicts that the return for t+1 will be negative, then we go short SPY at market close.
Predict every day and adjust the position every day.
What does this mean?
We will have two types of models - one that predicts the percentage change in price and the other that predicts whether SPY will be up or down. If the model predicts that tomorrow's price will be higher than today's, we will buy SPY and make money if the price actually goes up. If the models predicts that the price will be lower, we sell SPY and if the price actually goes down, we will make money. We will then add up the profit or loss we make every day and find the cumulative i.e. the total amount of money we have made over the course of time. For this project, we will assume that we start with $1. If at the end of the trading period we have more than 1 dollar, then our strategy has made money. If the ending amount is less than 1 dollar, then we have lost money from trading.
Data and variables
As mentioned before, all the fundamental information are taken from Hull Investments. The data they provide is only upto 5/4/2015, so my dataset does not go beyond this date. The price series and the volume series are taken from CBOE website and ycharts.
Technical indicators:
Some of these might be called quantitative indicators as well, especially the 14th and 15th indicators.
SPY_prior_ret – 1-day lagged return of SPY. If we are predicting the return of time t, this will be the return of t-1. Daily returns are calculated from close to close.
VIX_prior_ret – 1-day lagged return of CBOE VIX. VIX is calculated using the implied volatility of 30 days out SPX options.
MSCI_EM_prior_ret – 1-day lagged return for the emerging market.
SPY_trend – this is used to define whether SPY is in an uptrend or a downtrend. The calculation of this will be shown in python code later in the book.
RSI – Relative strength index.
MACD_signal – crossover of MACD and the 9-ema MACD. MACD is Moving average convergence divergence.
ROC – Rate of change. 8. S0d – Stochastic oscillator.
IBS – Internal Bar Strength Indicator. This was based on an article published by Jonathan Kinlay. Calculation will be shown in python codes.
excess_volume_prior – this is daily volume minus the 30-day average volume expressed as a percentage of the 30-day average volume.
SPY_Z_score – this is the expanding z-score of SPY returns. While calculating this, we should ensure that we are not using the entire data to calculate the mean and the standard deviation. If we do, our work will suffer from data snooping. Thus, I calculated the z-score based on the data I have at a given point in time.
Data from Hull and Qiao: 12. Moving Average (MA_prior): buy and sell rules based on the relative levels of the current price versus the past 10-month simple moving average (Faber 2007). 13. Oil Price Shocks (OIL_prior): OIL is constructed as the log of the current front oil futures (Casassus and Higuera 2011). 14. Implied Correlation (IC_prior) is the average equity options-implied correlation (Driessen, Maenhout, and Vilkov 2013). 15. Variance Risk Premium (VRP_prior) is VIX squared minus the five-minute realized variance (Bollerslev, Tauchen, and Zhou 2009). Hull and Qiao used VIX minus the volatility forecast from a GARCH model following Yang and Zhang’s style (2000). 16. SIM_prior - Sell in May and Go Away (Bouman and Jacobsen 2002 and Doeswijk 2009). 17. PCR - Ratio of Stock Price to Commodity Price (Black et al (2014). 18. PCA.tech – principal component analysis of technical indicators (Neely et al 2014).
Fundamental data from Hull and Qiao:
DP_prior is Dividend-Price Ratio (Fama and French 1988).
PE_prior is Price-to-Earnings Ratio (Campbell and Shiller 1988).
BM_prior is Book-to-Market Ratio (Pontiff and Schall 1998).
CAPE_prior is Cyclically Adjusted Price to Earnings Ratio (Shiller 2000).
PCA_prior - price is Principal Component of Price Ratios that includes DP, PE, BM, and CAPE.
BY_prior is Bond Yield (Pastor and Stambaugh 2009).
DEF_prior is Default Spread, which is the difference between the Baa and Aaa corporate bond yields (Fama and French 1989).
Term Spread (TERM_prior) is the yield difference between the 10-year Treasury Note and the three-month Treasury Bill (Fama and French 1989).
Cointegrating Residual of Consumption, Assets, and Wealth (CAY_prior) is cointegrating residual of log consumption, assets, and wealth (Lettau and Ludvigson 2001).
BDI_prior is Baltic Dry Index (Bakshi, Panayotov, and Skoulakis 2011).
NOS_prior is New Orders/Shipments (Jones and Tuzel 2012).
CPI_prior is Consumer Price Index (Campbell and Vuolteenaho 2004).
Short Interest (SI_prior): the average of short interest divided by total shares outstanding of individual stocks (Rapach, Ringgenberg, and Zhou 2015). In Hull and Qiao’s paper, they calculated this as the sum of all shares short on the NYSE divided by the average daily trading volume over the past 30 days.
Target variable
We will be looking to solve two different problems. The first one is to predict the daily returns and the other one is to predict the direction of market movements. So, we will be working with both classification and regression models. Daily returns are the percentage change in price on a day to day basis. Direction is simply the sign of the percentage change. So, it will be -1 for negative percentage change and 1 for positive percentage change.
Let’s now work on lagging the features. As mentioned before, we are trying to predict the return/direction of SPY at time t using the data we have at time t-1. So, we need to organize the python dataframe in such a way that the return of time t is matched with the features of time t-1 row-wise.
Let's start making calculations and creating technical indicators, which are just mathematical manipulations of the price series and the volume data.
Let's start calculating technical indicators for the SPY time series. We will be creating 8 different technical indicators based on the OHLC(open, high, low, close(SPY)) and volume data series. The 8 technical indicators are as follows. Brief description of each indicator will be added as we create them.
Technical Indicators for the price series - Trend, Relative Strength Index(RSI), Moving Average Convergence Divergence(MACD), Z-score of returns, Rate of Change(ROC), Stochastic Oscillator(S0K / S0d.
Technical Indicators for the volume series - Excess-volume, On Balance Volume.
For more details on each indicator, please go to stockcharts.com
Let's start building models
TimeSeriesSplit - Train/test splitting the time series data. K-fold split does not work in time series prediction
Train/test splitting. In time series, the order of the data is of utmost importance. So, we can't use k-fold split. Instead, we need to do time series split where the temporal order is preserved. We have 5217 rows in the dataset. We will be using the first 4000 rows as the train set and the rest as the testing set.
How do we compare models?
Most of the research papers I have read end their papers by simply looking at the models from a modeling perspective and not trading perspective. If the paper is studying regression models, it would end by comparing the Root Mean Square Error (RMSE) of the models and rank the models based on this. For classification models, it would end with confusion matrix or accuracy-precision matrix and rank them accordingly. I find this method non-economical and I believe that Hull and Qiao would agree with me. I will look at RMSE and confusion matrix for regression and classification models respectively to optimize models; however, I won’t give as much emphasis on them as other research papers have done. Instead, I will look at the equity curve of the trading strategy each model can build and look at the Annualized Return, Annualized Volatility, and Sharpe Ratio to compare the models.
The annualized return is the geometric average amount of money earned by an investment each year over a given time period. It is calculated as a geometric average to show what an investor would earn over a period of time if the annual return was compounded. Volatility is the standard deviation of the returns. It measures how stable our money making trading strategy is. If we lose a lot and also make a lot, then the volatility is high. We don't want that. we want stability in our daily profits. Sharpe ratio is the risk adjusted return and it is calculated by dividing the annualized return by the volatility. The higher it is, the better the trading strategy is.
So, we want high annualized return with lower volatility so that we get high sharpe ratio.
For regression models
If you carefully look at the features I have chosen, you will see that it does not have two of the features I described before. It does not have 'PCA.price_prior' and 'PCA.tech_prior'. We know that 'PCA.price_prior' is the principal component of four of the price related fundamental features i.e. DP, PE, BM, and CAPE. So, while running our models, we can either use the principal component of these features or the features themselves. I found that using the features produces a better trading strategy and that is the reason I am not using the principal component. 'PCA.tech_prior' is the principal component of technical indicators. Since I already have many technical indicators, I took this off as well. However, you will later that adding these features make Random Forest a better model
We will start with Linear Regression model.
Let's now run a linear regression model as a starter. Going in we should know that linear regression is typical used for inference and does a poor job at prediction. So, we should not be surpised if we see poor performance.
Let's take some time to discuss what the graphs mean. The first graph is the scatter diagram of the actual return vs the predicted return. If the predicted value was the same as the actual value, all the numbers will lie on the 45 degree line. The higher the mismatch, the further away the points will be from the 45 degree line. The second graph is called the equity curve. It is the cummulative sum of all the money we have made since we started trading using the model we have created. We started with 1 dollar and ended with about $2.2 for this strategy.
Let's first select 'SPY_prior_ret' and choose the features that have low correlation to it. We will then run the new linear regression and evaluate the result.
Correlation screen didn't improve the model. However, the screening was manually done through observation.
What do I mean by 'didn't improve'? As I mentioned before, I am looking at the Annualized return, volatility, sharpe ratio and the equity curve. This model has lower Annualized return with same volatility and hence lower sharpe ratio compared to the previous linear model. We can also see that the 1 dollar that we invested in the beginning is worth less than the amount the first linear model gave us.
Let's now apply univariate feature selection and see if we can improve the linear model.
Let's now pick 5 features based on the ranking and statistical significance.
The top 5 features are 'SPY_prior_ret', 'SPY_Z_score', 'MSCI_EM_prior_ret', 'IBS', 'MACD_signal'
The result of this model is blatantly bad. So, we will use the first linear model as a benchmark and develop other advanced models and see if we can beat the best linear model we have so far.
Benchmark Linear Model Performance:
Annualized return - 16.16%
Volatility - 15.15%
Sharpe Ratio - 1.07
You might have noticed that two of the features I did not use in the best linear model are included in this one, namely ‘PCA.tech_prior’ and ‘PCA.price_prior.’ The reason is simple. Adding them improves the trading strategy that comes out of the Random Forest Regression model.
So far, Random Forest is the only models that have beaten the Benchmarch Linear Model. With the same volatility, the Random Forest model produces an annualized return that is about 3.60 percent higher. It also has a higher sharpe ratio compared to the Benchmarch Linear Model.
Model Optimization
Random Forest Regression Hyperparameter Tuning
We will tune the Random Forest model hyperparameters, focusing on the minimum number of samples per leaf for the underlying decision tree regressor.
As we can see, the RMSE decreases exponential in the beginning but eventually it sort of flattens out. Let's now increase the minimum leaves and see if we can improve the random forest model. Let’s try using 30 min_sample_leaf instead of 10. I have also increased the n_estimators to reduce noise in the trend data.
Optimized Random Forest
This optimzed Random Forest Regression does better than the original Random Forest Regression model. The annualized return is about 2.5% higher while maintaining the same volatility.
Classification Models:
I tried Logistic, Naive Bayes, Random Forest, Decision Tree, Adaboost, and Xgboost Classification models but none of them can offer a better trading strategy than the Optimized Random Forest Model.
Conclusion
We evaluated about 13 models, regression and classification types combined. Of all of them, only one stood out as the champion namely the Optimized Random Forest Regression.
We saw that optmizing the Random Forest model improved the performance of the trading strategy. Our goal is to be able to design machine learning models that learn well on the training set and can generalize enough that their performance remains intact on data not included in the training. We were able to do this. Below is a tabular ranking of the best trading strategy we got.