GitHub Repository: AI4Finance-Foundation/FinRL
Path: blob/master/finrl/applications/imitation_learning/Stock_Selection.ipynb
⁷³² views

Kernel: finrl

Installation Setup

In [26]:

%load_ext autoreload
%autoreload 2

Out[26]:

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

In [27]:

import numpy as np
import pandas as pd
import dask.dataframe as dd
from scipy import stats
import statsmodels.api as sm
from statsmodels.regression.rolling import RollingOLS
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns; sns.set()
from pandas.tseries.offsets import *
from dateutil.relativedelta import *
import datetime as dt
import os
from linearmodels.asset_pricing import TradedFactorModel, LinearFactorModel
from IPython.core.pylabtools import figsize
from IPython.core.interactiveshell import InteractiveShell
from fredapi import Fred
fred = Fred(api_key = 'b0363f9c9d853b92b27e06c4727bc2ea')
import pandas_datareader.data as web

%matplotlib inline 
%pylab inline
pylab.rcParams['figure.figsize'] = (20,10)

Out[27]:

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib

In [28]:

import pickle
from multiprocessing import Pool 
import random
import json
import sys
import StockPortfolioEnv

import pytz
import itertools
from datetime import datetime as dt
from finrl.meta.preprocessor.yahoodownloader import YahooDownloader
from finrl.meta.preprocessor.preprocessors import FeatureEngineer, data_split
from finrl import config
from finrl import config_tickers
from finrl.config import (
    DATA_SAVE_DIR,
    TRAINED_MODEL_DIR,
    TENSORBOARD_LOG_DIR,
    RESULTS_DIR,
    INDICATORS,
    TRAIN_START_DATE,
    TRAIN_END_DATE,
    TEST_START_DATE,
    TEST_END_DATE,
    TRADE_START_DATE,
    TRADE_END_DATE,
)

if not os.path.exists("./" + config.RESULTS_DIR):
    os.makedirs("./" + config.RESULTS_DIR)

In [29]:

InteractiveShell.ast_node_interactivity = "all"
pd.options.display.float_format = '{:,.3f}'.format
pd.set_option('mode.use_inf_as_na', True)
pd.set_option('display.max_columns', 300)
pd.set_option('display.max_rows', 500)
idx = pd.IndexSlice
import warnings
warnings.filterwarnings('ignore')

Data

In [30]:

Datasets = os.path.dirname(os.getcwd()) + "\\data\\"
__depends__ = [Datasets+"RetailMarketOrder.sas7bdat", 
               Datasets+"InstitutionOrder.sas7bdat", 
               Datasets+"FutureReturn.sas7bdat"]
__dest__ = []

In [31]:

__depends__

Out[31]:

['D:\\Google Drive - Columbia University\\AI4Finance\\Imitation Learning\\data\\RetailMarketOrder.sas7bdat',
 'D:\\Google Drive - Columbia University\\AI4Finance\\Imitation Learning\\data\\InstitutionOrder.sas7bdat',
 'D:\\Google Drive - Columbia University\\AI4Finance\\Imitation Learning\\data\\FutureReturn.sas7bdat']

Column Details

permno is the permanent stock identifier used in CRSP
Retail Market Order Imbalance (moribvol) = (Shares Purchased - Shares Sold)/(Shares Purchased+Shares Sold)
- We standardize by dividing the sum of total buy and sell retail market orders to make this variable comparable across different stocks
- Tracking Retail Investor Activity: https://onlinelibrary.wiley.com/doi/abs/10.1111/jofi.13033
5 Russell groups
- Classification of stocks according to the constituent members of Russell indices
- Large-Cap (russellgroup = 1): stocks in Russell Top 200, which consists of the largest 200 members in Russell 1000
- Mid-Cap (russellgroup = 2): stocks in Russell Mid-Cap, which consists of the smallest 800 members in Russell 1000
- Small-Cap (russellgroup = 3): the largest 1000 members in Russell 2000
- Micro-Cap (russellgroup = 4): stocks in Russell Micro-Cap, which consists of the smallest 1000 members in Russell 2000 plus the largest 1000 stocks outside Russell 2000
- Nano-Cap (russellgroup = 5): all remaining stocks
11 sectors
- The Global Industry Classification Standard (GICS)
- I name a sector using the corresponding ticker of the SPDR sector ETF (see, https://www.sectorspdr.com/sectorspdr/)

In [32]:

retail = pd.read_sas(__depends__[0], encoding = 'latin-1')
retail.columns = retail.columns.str.lower()
retail[['permno','russellgroup']] = retail[['permno','russellgroup']].astype(int)
retail.tail()

Out[32]:

In [34]:

retail.groupby(['date','sector'])['permno'].count().unstack().plot();

Out[34]:

In [35]:

institution = pd.read_sas(__depends__[1], encoding = 'latin-1')
institution.columns = institution.columns.str.lower()
institution[['permno','russellgroup']] = institution[['permno','russellgroup']].astype(int)
institution.tail()

Out[35]:

In [36]:

institution.groupby(['date','sector'])['permno'].count().unstack().plot();

Out[36]:

In [37]:

ret = pd.read_sas(__depends__[2], encoding = 'latin-1')
ret.columns = ret.columns.str.lower()
ret['permno'] = ret['permno'].astype(int)
ret.tail()

Out[37]:

ret_i, i = 1,5,10,20, is the return over next day, 5 days, 10 days, 20 days
- I already moved the return backward for one trading day to avoid potentisl forward-looking bias. For example, -0.003 ret_1 of stock 93436 at 2022-06-24 is the return earned by this stock at 2022-06-27

More Data Explorations

In hindsight, we present stock correlation anaysis between imbalance trades and return rates in 5 days, as an example

In [38]:

# All data points - merged retail imblance trades and returns rates
retail_ret = pd.merge(retail[['ticker','permno','date','moribvol', "russellgroup", "sector", "price"]], ret, on = ['permno','date'], how = 'inner')
retail_ret = retail_ret.rename(columns={'ticker': 'tic'})
retail_ret

Out[38]:

In [39]:

# Correlation anaysis between imbalance trades and return rates in 5 days
fut_ret = 'ret_5'

# tic, russellgroup, and sector may be changed with respect to a company over time...
grouped = retail_ret.groupby(['permno', 'tic', "russellgroup", "sector"])

df_correlation = grouped.apply(lambda x: x['moribvol'].corr(x[fut_ret])).dropna()
df_correlation = df_correlation[df_correlation > 0].reset_index()
df_correlation = df_correlation.rename(columns={df_correlation.columns[-1]: 'corr'})

In [40]:

df_correlation.groupby(['russellgroup','sector'])['permno'].count().unstack().plot.bar();

Out[40]:

In [41]:

# A panel of plots of correlation values
g = sns.FacetGrid(df_correlation, col='russellgroup', row='sector')
g.map(sns.histplot, 'corr', kde=False)
g.fig.tight_layout()
g.fig.show()

Out[41]:

<seaborn.axisgrid.FacetGrid at 0x14106f21580>

In [42]:

# Plot the distribution of correlation values for each sectors
grouped = df_correlation.groupby(['sector'])
grouped['corr'].plot.hist(alpha=0.5, bins=20)

plt.legend()
plt.show()

Out[42]:

sector
XLB    AxesSubplot(0.125,0.11;0.775x0.77)
XLC    AxesSubplot(0.125,0.11;0.775x0.77)
XLE    AxesSubplot(0.125,0.11;0.775x0.77)
XLF    AxesSubplot(0.125,0.11;0.775x0.77)
XLI    AxesSubplot(0.125,0.11;0.775x0.77)
XLK    AxesSubplot(0.125,0.11;0.775x0.77)
XLP    AxesSubplot(0.125,0.11;0.775x0.77)
XLR    AxesSubplot(0.125,0.11;0.775x0.77)
XLU    AxesSubplot(0.125,0.11;0.775x0.77)
XLV    AxesSubplot(0.125,0.11;0.775x0.77)
XLY    AxesSubplot(0.125,0.11;0.775x0.77)
Name: corr, dtype: object

<matplotlib.legend.Legend at 0x14106f03790>

In [43]:

# Max correlated stocks in different cap sizes and sectors

def get_max_row(group):
    return group[group['corr'] == group['corr'].max()].iloc[0, :]

df_correlation.groupby(['russellgroup','sector']).apply(get_max_row)

Out[43]:

Stock Selection

Pick 11 Large Cap Tech (XLK) firms whose retail investor trades are significantly correlatede with return rates in 5 days. Stocks are ["QCOM", "ADSK", "FSLR", "MSFT", "AMD", "ORCL", "INTU", "WU", "LRCX", "TXN", "CSCO"]

In [44]:

# stock_grp = ['Large-Cap','Mid-Cap','Small-Cap','Micro-Cap','Nano-Cap']
# tech_largeCap = retail[(retail["sector"] == 'XLK')]

tech_largeCap = retail[(retail["sector"] == 'XLK')& (retail["russellgroup"] == 1)]
tech_largeCap_ret = pd.merge(tech_largeCap[['ticker','permno','date','moribvol', "price", "russellgroup"]], 
                             ret, 
                             on = ['permno','date'], 
                             how = 'inner')

tech_largeCap_ret

Out[44]:

In [45]:

tech_largeCap_ret['ticker'].unique()

Out[45]:

array(['SUNW', 'JAVA', 'ORCL', 'MSFT', 'EMC', 'FISV', 'DELL', 'MMI',
       'IBM', 'NOW', 'WDAY', 'AAPL', 'AMAT', 'CDK', 'PYPL', 'TXN', 'HPE',
       'SQ', 'VSM', 'ZM', 'GLW', 'MOT', 'MSI', 'HPQ', 'DXC', 'ADP',
       'LRCX', 'MU', 'INTC', 'ADI', 'AMD', 'ADBE', 'SYMC', 'CSCO', 'QCOM',
       'FDC', 'INTU', 'NTAP', 'ADSK', 'BRCM', 'CTSH', 'NVDA', 'JNPR',
       'GPN', 'FIS', 'CRM', 'MA', 'WU', 'FSLR', 'VMW', 'V', 'AVGO', 'AOL'],
      dtype=object)

In [46]:

fut_ret = 'ret_5'
grouped = tech_largeCap_ret.groupby('permno')

tech_largeCap_correlation = grouped.apply(lambda x: x['moribvol'].corr(x[fut_ret])).dropna()
tech_largeCap_correlation = tech_largeCap_correlation.sort_values(ascending=False).reset_index()
tech_largeCap_correlation = tech_largeCap_correlation.rename(columns={tech_largeCap_correlation.columns[-1]: 'corr'})

tech_largeCap_correlation = tech_largeCap_correlation.merge(tech_largeCap_ret[['ticker', 'permno', "russellgroup"]], on='permno')
tech_largeCap_correlation = tech_largeCap_correlation.drop_duplicates()
tech_largeCap_correlation

Out[46]:

In [47]:

selected_tech_tic = ["QCOM", "ADSK", "FSLR", "MSFT", "AMD", "ORCL", "INTU", "WU", "LRCX", "TXN", "CSCO"]
selected_tech_ret = retail_ret[retail_ret['tic'].isin(selected_tech_tic)]

# Earliest date of public trades
min(selected_tech_ret['date'])

# latest date of public trades
max(selected_tech_ret['date'])

# Check if their lengths are consistent
assert selected_tech_ret.groupby(["permno", "tic"]).size().nunique() == 1

# Sort by date 
selected_tech_ret = selected_tech_ret.sort_values(['date','tic'])

Out[47]:

Timestamp('2007-01-03 00:00:00')

Timestamp('2021-12-31 00:00:00')

Data Merge with Technical Indicators

In [48]:

from datetime import datetime as dt

tz = pytz.timezone("America/New_York")
start = tz.localize(dt(2007,1,3))
end = tz.localize(dt.today())

fe = FeatureEngineer(use_technical_indicator=True,
                     tech_indicator_list = INDICATORS,
                     use_turbulence=True,
                     user_defined_feature = False)

# yf_tech = YahooDownloader(start_date = start,
#                           end_date = end,
#                           ticker_list = selected_tech_tic).fetch_data()

# # Stack
# processed = fe.preprocess_data(yf_tech)

processed = pd.DataFrame()
for symbol in selected_tech_tic:
    yf_symbol = YahooDownloader(start_date = start, end_date = end, ticker_list = [symbol]).fetch_data()
    processed_symbol = fe.preprocess_data(yf_symbol)
    processed = pd.concat([processed, processed_symbol], axis=0)

processed = processed.copy()
processed = processed.fillna(0)
processed = processed.replace(np.inf,0)
processed['date'] = pd.to_datetime(processed['date'])

processed

Out[48]:

[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (4074, 8)
Successfully added technical indicators
Successfully added turbulence index

In [49]:

processed['tic'].unique()

Out[49]:

array(['QCOM', 'ADSK', 'FSLR', 'MSFT', 'AMD', 'ORCL', 'INTU', 'WU',
       'LRCX', 'TXN', 'CSCO'], dtype=object)

In [ ]:

# Merge technical indicators with returns
df_merged = pd.merge(retail_ret, processed, on=['tic', 'date'])
df_merged.to_csv('data/merged.csv', index=False)  # set index=False to exclude row index from CSV file

Installation Setup

Data

Column Details

More Data Explorations

Stock Selection

Data Merge with Technical Indicators

Product

Resources

Company