Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
packtpublishing
GitHub Repository: packtpublishing/machine-learning-for-algorithmic-trading-second-edition
Path: blob/master/19_recurrent_neural_nets/00_build_dataset.ipynb
2923 views
Kernel: Python 3

Create a dataset formatted for RNN examples

Imports & Settings

import warnings warnings.filterwarnings('ignore')
from pathlib import Path import numpy as np import pandas as pd
np.random.seed(42)
idx = pd.IndexSlice

Build daily dataset

DATA_DIR = Path('..', 'data')
prices = (pd.read_hdf(DATA_DIR / 'assets.h5', 'quandl/wiki/prices') .loc[idx['2010':'2017', :], ['adj_close', 'adj_volume']]) prices.info()
<class 'pandas.core.frame.DataFrame'> MultiIndex: 5698754 entries, (Timestamp('2010-01-04 00:00:00'), 'A') to (Timestamp('2017-12-29 00:00:00'), 'ZUMZ') Data columns (total 2 columns): # Column Dtype --- ------ ----- 0 adj_close float64 1 adj_volume float64 dtypes: float64(2) memory usage: 109.5+ MB

Select most traded stocks

n_dates = len(prices.index.unique('date')) dollar_vol = (prices.adj_close.mul(prices.adj_volume) .unstack('ticker') .dropna(thresh=int(.95 * n_dates), axis=1) .rank(ascending=False, axis=1) .stack('ticker'))
most_traded = dollar_vol.groupby(level='ticker').mean().nsmallest(500).index
returns = (prices.loc[idx[:, most_traded], 'adj_close'] .unstack('ticker') .pct_change() .sort_index(ascending=False)) returns.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 2013 entries, 2017-12-29 to 2010-01-04 Columns: 500 entries, AAPL to CNC dtypes: float64(500) memory usage: 7.7 MB

Stack 21-day time series

n = len(returns) T = 21 # days tcols = list(range(T)) tickers = returns.columns
data = pd.DataFrame() for i in range(n-T-1): df = returns.iloc[i:i+T+1] date = df.index.max() data = pd.concat([data, df.reset_index(drop=True).T .assign(date=date, ticker=tickers) .set_index(['ticker', 'date'])]) data = data.rename(columns={0: 'label'}).sort_index().dropna() data.loc[:, tcols[1:]] = (data.loc[:, tcols[1:]].apply(lambda x: x.clip(lower=x.quantile(.01), upper=x.quantile(.99)))) data.info()
<class 'pandas.core.frame.DataFrame'> MultiIndex: 995499 entries, ('A', Timestamp('2010-02-04 00:00:00')) to ('ZION', Timestamp('2017-12-29 00:00:00')) Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 label 995499 non-null float64 1 1 995499 non-null float64 2 2 995499 non-null float64 3 3 995499 non-null float64 4 4 995499 non-null float64 5 5 995499 non-null float64 6 6 995499 non-null float64 7 7 995499 non-null float64 8 8 995499 non-null float64 9 9 995499 non-null float64 10 10 995499 non-null float64 11 11 995499 non-null float64 12 12 995499 non-null float64 13 13 995499 non-null float64 14 14 995499 non-null float64 15 15 995499 non-null float64 16 16 995499 non-null float64 17 17 995499 non-null float64 18 18 995499 non-null float64 19 19 995499 non-null float64 20 20 995499 non-null float64 21 21 995499 non-null float64 dtypes: float64(22) memory usage: 171.0+ MB
data.shape
(995499, 22)
data.to_hdf('data.h5', 'returns_daily')

Build weekly dataset

We load the Quandl adjusted stock price data:

prices = (pd.read_hdf(DATA_DIR / 'assets.h5', 'quandl/wiki/prices') .adj_close .unstack().loc['2007':]) prices.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 2896 entries, 2007-01-01 to 2018-03-27 Columns: 3199 entries, A to ZUMZ dtypes: float64(3199) memory usage: 70.7 MB

Resample to weekly frequency

We start by generating weekly returns for close to 2,500 stocks without missing data for the 2008-17 period, as follows:

returns = (prices .resample('W') .last() .pct_change() .loc['2008': '2017'] .dropna(axis=1) .sort_index(ascending=False)) returns.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 522 entries, 2017-12-31 to 2008-01-06 Freq: -1W-SUN Columns: 2489 entries, A to ZUMZ dtypes: float64(2489) memory usage: 9.9 MB
returns.head().append(returns.tail())

Create & stack 52-week sequences

We'll use 52-week sequences, which we'll create in a stacked format:

n = len(returns) T = 52 # weeks tcols = list(range(T)) tickers = returns.columns
data = pd.DataFrame() for i in range(n-T-1): df = returns.iloc[i:i+T+1] date = df.index.max() data = pd.concat([data, (df.reset_index(drop=True).T .assign(date=date, ticker=tickers) .set_index(['ticker', 'date']))]) data.info()
<class 'pandas.core.frame.DataFrame'> MultiIndex: 1167341 entries, ('A', Timestamp('2017-12-31 00:00:00')) to ('ZUMZ', Timestamp('2009-01-11 00:00:00')) Data columns (total 53 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 1167341 non-null float64 1 1 1167341 non-null float64 2 2 1167341 non-null float64 3 3 1167341 non-null float64 4 4 1167341 non-null float64 5 5 1167341 non-null float64 6 6 1167341 non-null float64 7 7 1167341 non-null float64 8 8 1167341 non-null float64 9 9 1167341 non-null float64 10 10 1167341 non-null float64 11 11 1167341 non-null float64 12 12 1167341 non-null float64 13 13 1167341 non-null float64 14 14 1167341 non-null float64 15 15 1167341 non-null float64 16 16 1167341 non-null float64 17 17 1167341 non-null float64 18 18 1167341 non-null float64 19 19 1167341 non-null float64 20 20 1167341 non-null float64 21 21 1167341 non-null float64 22 22 1167341 non-null float64 23 23 1167341 non-null float64 24 24 1167341 non-null float64 25 25 1167341 non-null float64 26 26 1167341 non-null float64 27 27 1167341 non-null float64 28 28 1167341 non-null float64 29 29 1167341 non-null float64 30 30 1167341 non-null float64 31 31 1167341 non-null float64 32 32 1167341 non-null float64 33 33 1167341 non-null float64 34 34 1167341 non-null float64 35 35 1167341 non-null float64 36 36 1167341 non-null float64 37 37 1167341 non-null float64 38 38 1167341 non-null float64 39 39 1167341 non-null float64 40 40 1167341 non-null float64 41 41 1167341 non-null float64 42 42 1167341 non-null float64 43 43 1167341 non-null float64 44 44 1167341 non-null float64 45 45 1167341 non-null float64 46 46 1167341 non-null float64 47 47 1167341 non-null float64 48 48 1167341 non-null float64 49 49 1167341 non-null float64 50 50 1167341 non-null float64 51 51 1167341 non-null float64 52 52 1167341 non-null float64 dtypes: float64(53) memory usage: 476.6+ MB
data[tcols] = (data[tcols].apply(lambda x: x.clip(lower=x.quantile(.01), upper=x.quantile(.99))))
data = data.rename(columns={0: 'fwd_returns'})
data['label'] = (data['fwd_returns'] > 0).astype(int)
data.shape
(1167341, 54)
data.sort_index().to_hdf('data.h5', 'returns_weekly')