Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
AI4Finance-Foundation
GitHub Repository: AI4Finance-Foundation/FinRL
Path: blob/master/finrl/applications/imitation_learning/Stock_Selection.ipynb
732 views
Kernel: finrl

Installation Setup

%load_ext autoreload %autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
import numpy as np import pandas as pd import dask.dataframe as dd from scipy import stats import statsmodels.api as sm from statsmodels.regression.rolling import RollingOLS import matplotlib.pyplot as plt from matplotlib import ticker import seaborn as sns; sns.set() from pandas.tseries.offsets import * from dateutil.relativedelta import * import datetime as dt import os from linearmodels.asset_pricing import TradedFactorModel, LinearFactorModel from IPython.core.pylabtools import figsize from IPython.core.interactiveshell import InteractiveShell from fredapi import Fred fred = Fred(api_key = 'b0363f9c9d853b92b27e06c4727bc2ea') import pandas_datareader.data as web %matplotlib inline %pylab inline pylab.rcParams['figure.figsize'] = (20,10)
%pylab is deprecated, use %matplotlib inline and import the required libraries. Populating the interactive namespace from numpy and matplotlib
import pickle from multiprocessing import Pool import random import json import sys import StockPortfolioEnv import pytz import itertools from datetime import datetime as dt from finrl.meta.preprocessor.yahoodownloader import YahooDownloader from finrl.meta.preprocessor.preprocessors import FeatureEngineer, data_split from finrl import config from finrl import config_tickers from finrl.config import ( DATA_SAVE_DIR, TRAINED_MODEL_DIR, TENSORBOARD_LOG_DIR, RESULTS_DIR, INDICATORS, TRAIN_START_DATE, TRAIN_END_DATE, TEST_START_DATE, TEST_END_DATE, TRADE_START_DATE, TRADE_END_DATE, ) if not os.path.exists("./" + config.RESULTS_DIR): os.makedirs("./" + config.RESULTS_DIR)
InteractiveShell.ast_node_interactivity = "all" pd.options.display.float_format = '{:,.3f}'.format pd.set_option('mode.use_inf_as_na', True) pd.set_option('display.max_columns', 300) pd.set_option('display.max_rows', 500) idx = pd.IndexSlice import warnings warnings.filterwarnings('ignore')

Data

Datasets = os.path.dirname(os.getcwd()) + "\\data\\" __depends__ = [Datasets+"RetailMarketOrder.sas7bdat", Datasets+"InstitutionOrder.sas7bdat", Datasets+"FutureReturn.sas7bdat"] __dest__ = []
__depends__
['D:\\Google Drive - Columbia University\\AI4Finance\\Imitation Learning\\data\\RetailMarketOrder.sas7bdat', 'D:\\Google Drive - Columbia University\\AI4Finance\\Imitation Learning\\data\\InstitutionOrder.sas7bdat', 'D:\\Google Drive - Columbia University\\AI4Finance\\Imitation Learning\\data\\FutureReturn.sas7bdat']

Column Details

  • permno is the permanent stock identifier used in CRSP

  • Retail Market Order Imbalance (moribvol) = (Shares Purchased - Shares Sold)/(Shares Purchased+Shares Sold)

  • 5 Russell groups

    • Classification of stocks according to the constituent members of Russell indices

    • Large-Cap (russellgroup = 1): stocks in Russell Top 200, which consists of the largest 200 members in Russell 1000

    • Mid-Cap (russellgroup = 2): stocks in Russell Mid-Cap, which consists of the smallest 800 members in Russell 1000

    • Small-Cap (russellgroup = 3): the largest 1000 members in Russell 2000

    • Micro-Cap (russellgroup = 4): stocks in Russell Micro-Cap, which consists of the smallest 1000 members in Russell 2000 plus the largest 1000 stocks outside Russell 2000

    • Nano-Cap (russellgroup = 5): all remaining stocks

  • 11 sectors

retail = pd.read_sas(__depends__[0], encoding = 'latin-1') retail.columns = retail.columns.str.lower() retail[['permno','russellgroup']] = retail[['permno','russellgroup']].astype(int) retail.tail()
retail.groupby(['date','sector'])['permno'].count().unstack().plot();
Image in a Jupyter notebook
institution = pd.read_sas(__depends__[1], encoding = 'latin-1') institution.columns = institution.columns.str.lower() institution[['permno','russellgroup']] = institution[['permno','russellgroup']].astype(int) institution.tail()
institution.groupby(['date','sector'])['permno'].count().unstack().plot();
Image in a Jupyter notebook
ret = pd.read_sas(__depends__[2], encoding = 'latin-1') ret.columns = ret.columns.str.lower() ret['permno'] = ret['permno'].astype(int) ret.tail()
  • ret_i, i = 1,5,10,20, is the return over next day, 5 days, 10 days, 20 days

    • I already moved the return backward for one trading day to avoid potentisl forward-looking bias. For example, -0.003 ret_1 of stock 93436 at 2022-06-24 is the return earned by this stock at 2022-06-27

More Data Explorations

In hindsight, we present stock correlation anaysis between imbalance trades and return rates in 5 days, as an example

# All data points - merged retail imblance trades and returns rates retail_ret = pd.merge(retail[['ticker','permno','date','moribvol', "russellgroup", "sector", "price"]], ret, on = ['permno','date'], how = 'inner') retail_ret = retail_ret.rename(columns={'ticker': 'tic'}) retail_ret
# Correlation anaysis between imbalance trades and return rates in 5 days fut_ret = 'ret_5' # tic, russellgroup, and sector may be changed with respect to a company over time... grouped = retail_ret.groupby(['permno', 'tic', "russellgroup", "sector"]) df_correlation = grouped.apply(lambda x: x['moribvol'].corr(x[fut_ret])).dropna() df_correlation = df_correlation[df_correlation > 0].reset_index() df_correlation = df_correlation.rename(columns={df_correlation.columns[-1]: 'corr'})
df_correlation.groupby(['russellgroup','sector'])['permno'].count().unstack().plot.bar();
Image in a Jupyter notebook
# A panel of plots of correlation values g = sns.FacetGrid(df_correlation, col='russellgroup', row='sector') g.map(sns.histplot, 'corr', kde=False) g.fig.tight_layout() g.fig.show()
<seaborn.axisgrid.FacetGrid at 0x14106f21580>
Image in a Jupyter notebook
# Plot the distribution of correlation values for each sectors grouped = df_correlation.groupby(['sector']) grouped['corr'].plot.hist(alpha=0.5, bins=20) plt.legend() plt.show()
sector XLB AxesSubplot(0.125,0.11;0.775x0.77) XLC AxesSubplot(0.125,0.11;0.775x0.77) XLE AxesSubplot(0.125,0.11;0.775x0.77) XLF AxesSubplot(0.125,0.11;0.775x0.77) XLI AxesSubplot(0.125,0.11;0.775x0.77) XLK AxesSubplot(0.125,0.11;0.775x0.77) XLP AxesSubplot(0.125,0.11;0.775x0.77) XLR AxesSubplot(0.125,0.11;0.775x0.77) XLU AxesSubplot(0.125,0.11;0.775x0.77) XLV AxesSubplot(0.125,0.11;0.775x0.77) XLY AxesSubplot(0.125,0.11;0.775x0.77) Name: corr, dtype: object
<matplotlib.legend.Legend at 0x14106f03790>
Image in a Jupyter notebook
# Max correlated stocks in different cap sizes and sectors def get_max_row(group): return group[group['corr'] == group['corr'].max()].iloc[0, :] df_correlation.groupby(['russellgroup','sector']).apply(get_max_row)

Stock Selection

Pick 11 Large Cap Tech (XLK) firms whose retail investor trades are significantly correlatede with return rates in 5 days. Stocks are ["QCOM", "ADSK", "FSLR", "MSFT", "AMD", "ORCL", "INTU", "WU", "LRCX", "TXN", "CSCO"]

# stock_grp = ['Large-Cap','Mid-Cap','Small-Cap','Micro-Cap','Nano-Cap'] # tech_largeCap = retail[(retail["sector"] == 'XLK')] tech_largeCap = retail[(retail["sector"] == 'XLK')& (retail["russellgroup"] == 1)] tech_largeCap_ret = pd.merge(tech_largeCap[['ticker','permno','date','moribvol', "price", "russellgroup"]], ret, on = ['permno','date'], how = 'inner') tech_largeCap_ret
tech_largeCap_ret['ticker'].unique()
array(['SUNW', 'JAVA', 'ORCL', 'MSFT', 'EMC', 'FISV', 'DELL', 'MMI', 'IBM', 'NOW', 'WDAY', 'AAPL', 'AMAT', 'CDK', 'PYPL', 'TXN', 'HPE', 'SQ', 'VSM', 'ZM', 'GLW', 'MOT', 'MSI', 'HPQ', 'DXC', 'ADP', 'LRCX', 'MU', 'INTC', 'ADI', 'AMD', 'ADBE', 'SYMC', 'CSCO', 'QCOM', 'FDC', 'INTU', 'NTAP', 'ADSK', 'BRCM', 'CTSH', 'NVDA', 'JNPR', 'GPN', 'FIS', 'CRM', 'MA', 'WU', 'FSLR', 'VMW', 'V', 'AVGO', 'AOL'], dtype=object)
fut_ret = 'ret_5' grouped = tech_largeCap_ret.groupby('permno') tech_largeCap_correlation = grouped.apply(lambda x: x['moribvol'].corr(x[fut_ret])).dropna() tech_largeCap_correlation = tech_largeCap_correlation.sort_values(ascending=False).reset_index() tech_largeCap_correlation = tech_largeCap_correlation.rename(columns={tech_largeCap_correlation.columns[-1]: 'corr'}) tech_largeCap_correlation = tech_largeCap_correlation.merge(tech_largeCap_ret[['ticker', 'permno', "russellgroup"]], on='permno') tech_largeCap_correlation = tech_largeCap_correlation.drop_duplicates() tech_largeCap_correlation
selected_tech_tic = ["QCOM", "ADSK", "FSLR", "MSFT", "AMD", "ORCL", "INTU", "WU", "LRCX", "TXN", "CSCO"] selected_tech_ret = retail_ret[retail_ret['tic'].isin(selected_tech_tic)] # Earliest date of public trades min(selected_tech_ret['date']) # latest date of public trades max(selected_tech_ret['date']) # Check if their lengths are consistent assert selected_tech_ret.groupby(["permno", "tic"]).size().nunique() == 1 # Sort by date selected_tech_ret = selected_tech_ret.sort_values(['date','tic'])
Timestamp('2007-01-03 00:00:00')
Timestamp('2021-12-31 00:00:00')

Data Merge with Technical Indicators

from datetime import datetime as dt tz = pytz.timezone("America/New_York") start = tz.localize(dt(2007,1,3)) end = tz.localize(dt.today()) fe = FeatureEngineer(use_technical_indicator=True, tech_indicator_list = INDICATORS, use_turbulence=True, user_defined_feature = False) # yf_tech = YahooDownloader(start_date = start, # end_date = end, # ticker_list = selected_tech_tic).fetch_data() # # Stack # processed = fe.preprocess_data(yf_tech) processed = pd.DataFrame() for symbol in selected_tech_tic: yf_symbol = YahooDownloader(start_date = start, end_date = end, ticker_list = [symbol]).fetch_data() processed_symbol = fe.preprocess_data(yf_symbol) processed = pd.concat([processed, processed_symbol], axis=0) processed = processed.copy() processed = processed.fillna(0) processed = processed.replace(np.inf,0) processed['date'] = pd.to_datetime(processed['date']) processed
[*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index [*********************100%***********************] 1 of 1 completed Shape of DataFrame: (4074, 8) Successfully added technical indicators Successfully added turbulence index
processed['tic'].unique()
array(['QCOM', 'ADSK', 'FSLR', 'MSFT', 'AMD', 'ORCL', 'INTU', 'WU', 'LRCX', 'TXN', 'CSCO'], dtype=object)
# Merge technical indicators with returns df_merged = pd.merge(retail_ret, processed, on=['tic', 'date']) df_merged.to_csv('data/merged.csv', index=False) # set index=False to exclude row index from CSV file