Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
packtpublishing
GitHub Repository: packtpublishing/machine-learning-for-algorithmic-trading-second-edition
Path: blob/master/12_gradient_boosting_machines/10_intraday_features.ipynb
2923 views
Kernel: Python 3

Intraday Strategy, Part 1: Feature Engineering

In this notebook, we load the high-quality NASDAQ100 minute-bar trade-and-quote data generously provided by Algoseek (available here) and engineer a few illustrative features.

The rich set of trade and quote information contained in the Algoseek data offers various opportunities to add inforation, e.g. about relative spreads and demand/supply imbalances, but since the data is fairly large we limit our efforts to a small number of features.

Note that we will assume throughout that we can always buy (sell) at the first (last) trade price for a given bar at no cost and without market impact; this is unlikely to be true in reality but simplifies the example).

The next notebook will use this dataset to train a model that predicts 1-minute returns using LightGBM.

Imports & Settings

import warnings warnings.filterwarnings('ignore')
%matplotlib inline from pathlib import Path from tqdm import tqdm import numpy as np import pandas as pd from scipy.stats import spearmanr import talib import matplotlib.pyplot as plt from matplotlib.ticker import FuncFormatter import seaborn as sns
sns.set_style('whitegrid') idx = pd.IndexSlice deciles = np.arange(.1, 1, .1)

Algoseek Trade & Quote Minute Bar Data

Data Dictionary

The Quote fields are based on changes to the NBBO (National Best Bid Offer) from the top-of-book price and size from each of the exchanges.

The enhanced Trade & Quote bar fields include the following fields:

  • Field: Name of Field.

  • Q / T: Field based on Quotes or Trades

  • Type: Field format

  • No Value: Value of field when there is no value or data.

    • Note: “Never” means field should always have a value EXCEPT for the first bar of the day.

  • Description: Description of the field.

See docs for additional detail.

idFieldQ/TTypeNo ValueDescription
1DateYYYYMMDDNeverTrade Date
2TickerStringNeverTicker Symbol
3TimeBarStartHHMM
HHMMSS
HHMMSSMMM
NeverFor minute bars: HHMM.
For second bars: HHMMSS.
Examples
- One second bar 130302 is from time greater than 130301 to 130302.
- One minute bar 1104 is from time greater than 1103 to 1104.
4OpenBarTimeQHHMMSSMMMNeverOpen Time of the Bar, for example one minute:
11:03:00.000
5OpenBidPriceQNumberNeverNBBO Bid Price as of bar Open
6OpenBidSizeQNumberNeverTotal Size from all Exchanges with
OpenBidPrice
7OpenAskPriceQNumberNeverNBBO Ask Price as of bar Open
8OpenAskSizeQNumberNeverTotal Size from all Exchange with
OpenAskPrice
9FirstTradeTimeTHHMMSSMMMBlankTime of first Trade
10FirstTradePriceTNumberBlankPrice of first Trade
11FirstTradeSizeTNumberBlankNumber of shares of first trade
12HighBidTimeQHHMMSSMMMNeverTime of highest NBBO Bid Price
13HighBidPriceQNumberNeverHighest NBBO Bid Price
14HighBidSizeQNumberNeverTotal Size from all Exchanges with HighBidPrice
15AskPriceAtHighBidPriceQNumberNeverAsk Price at time of Highest Bid Price
16AskSizeAtHighBidPriceQNumberNeverTotal Size from all Exchanges with AskPriceAtHighBidPrice
17HighTradeTimeTHHMMSSMMMBlankTime of Highest Trade
18HighTradePriceTNumberBlankPrice of highest Trade
19HighTradeSizeTNumberBlankNumber of shares of highest trade
20LowBidTimeQHHMMSSMMMNeverTime of lowest Bid
21LowBidPriceQNumberNeverLowest NBBO Bid price of bar.
22LowBidSizeQNumberNeverTotal Size from all Exchanges with LowBidPrice
23AskPriceAtLowBidPriceQNumberNeverAsk Price at lowest Bid price
24AskSizeAtLowBidPriceQNumberNeverTotal Size from all Exchanges with AskPriceAtLowBidPrice
25LowTradeTimeTHHMMSSMMMBlankTime of lowest Trade
26LowTradePriceTNumberBlankPrice of lowest Trade
27LowTradeSizeTNumberBlankNumber of shares of lowest trade
28CloseBarTimeQHHMMSSMMMNeverClose Time of the Bar, for example one minute: 11:03:59.999
29CloseBidPriceQNumberNeverNBBO Bid Price at bar Close
30CloseBidSizeQNumberNeverTotal Size from all Exchange with CloseBidPrice
31CloseAskPriceQNumberNeverNBBO Ask Price at bar Close
32CloseAskSizeQNumberNeverTotal Size from all Exchange with CloseAskPrice
33LastTradeTimeTHHMMSSMMMBlankTime of last Trade
34LastTradePriceTNumberBlankPrice of last Trade
35LastTradeSizeTNumberBlankNumber of shares of last trade
36MinSpreadQNumberNeverMinimum Bid-Ask spread size. This may be 0 if the market was crossed during the bar.
If negative spread due to back quote, make it 0.
37MaxSpreadQNumberNeverMaximum Bid-Ask spread in bar
38CancelSizeTNumber0Total shares canceled. Default=blank
39VolumeWeightPriceTNumberBlankTrade Volume weighted average price
Sum((Trade1SharesPrice)+(Trade2SharesPrice)+…)/TotalShares.
Note: Blank if no trades.
40NBBOQuoteCountQNumber0Number of Bid and Ask NNBO quotes during bar period.
41TradeAtBidQ,TNumber0Sum of trade volume that occurred at or below the bid (a trade reported/printed late can be below current bid).
42TradeAtBidMidQ,TNumber0Sum of trade volume that occurred between the bid and the mid-point:
(Trade Price > NBBO Bid ) & (Trade Price < NBBO Mid )
43TradeAtMidQ,TNumber0Sum of trade volume that occurred at mid.
TradePrice = NBBO MidPoint
44TradeAtMidAskQ,TNumber0Sum of ask volume that occurred between the mid and ask:
(Trade Price > NBBO Mid) & (Trade Price < NBBO Ask)
45TradeAtAskQ,TNumber0Sum of trade volume that occurred at or above the Ask.
46TradeAtCrossOrLockedQ,TNumber0Sum of trade volume for bar when national best bid/offer is locked or crossed.
Locked is Bid = Ask
Crossed is Bid > Ask
47VolumeTNumber0Total number of shares traded
48TotalTradesTNumber0Total number of trades
49FinraVolumeTNumber0Number of shares traded that are reported by FINRA.
Trades reported by FINRA are from broker-dealer internalization, dark pools, Over-The-Counter, etc.
FINRA trades represent volume that is hidden or not public available to trade.
50UptickVolumeTInteger0Total number of shares traded with upticks during bar.
An uptick = ( trade price > last trade price )
51DowntickVolumeTInteger0Total number of shares traded with downticks during bar.
A downtick = ( trade price < last trade price )
52RepeatUptickVolumeTInteger0Total number of shares where trade price is the same (repeated) and last price change was up during bar.
Repeat uptick = ( trade price == last trade price ) & (last tick direction == up )
53RepeatDowntickVolumeTInteger0Total number of shares where trade price is the same (repeated) and last price change was down during bar.
Repeat downtick = ( trade price == last trade price ) & (last tick direction == down )
54UnknownVolumeTInteger0When the first trade of the day takes place, the tick direction is “unknown” as there is no previous Trade to compare it to.
This field is the volume of the first trade after 4am and acts as an initiation value for the tick volume directions.
In future this bar will be renamed to UnkownTickDirectionVolume .

Notes

Empty Fields

An empty field has no value and is “Blank” , for example FirstTradeTime and there are no trades during the bar period. The field Volume measuring total number of shares traded in bar will be 0 if there are no Trades (see No Value column above for each field).

No Bid/Ask/Trade OHLC

During a bar timeframe there may not be a change in the NBBO or an actual Trade. For example, there can be a bar with OHLC Bid/Ask but no Trade OHLC.

Single Event

For bars with only one trade, one NBBO bid or one NBBO ask then Open/High/Low/Close price,size andtime will be the same.

AskPriceAtHighBidPrice, AskSizeAtHighBidPrice, AskPriceAtLowBidPrice, AskSizeAtLowBidPrice Fields

To provide consistent Bid/Ask prices at a point in time while showing the low/high Bid/Ask for the bar, AlgoSeek uses the low/high Bid and the corresponding Ask at that price.

FAQ

Why are Trade Prices often inside the Bid Price to Ask Price range?

The Low/High Bid/Ask is the low and high NBBO price for the bar range. Very often a Trade may not occur at these prices as the price may only last a few seconds or executions are being crossed at mid-point due to hidden order types that execute at mid-point or as price improvement over current Bid/Ask.

How to get exchange tradable shares?

To get the exchange tradable volume in a bar subtract Volume from FinraVolume.

  • Volume is the total number of shares traded.

  • FinraVolume is the total number of shares traded that are reported as executions by FINRA.

When a trade is done that is off the listed exchanges, it must be reported to FINRA by the brokerage firm or dark pool. Examples include:

  • internal crosses by broker dealer

  • over-the-counter block trades, and

  • dark pool executions.

Data prep

We use the 'Trade and Quote' dataset - see documentation for details on the definition of the numerous fields.

tcols = ['openbartime', 'firsttradetime', 'highbidtime', 'highasktime', 'hightradetime', 'lowbidtime', 'lowasktime', 'lowtradetime', 'closebartime', 'lasttradetime']
drop_cols = ['unknowntickvolume', 'cancelsize', 'tradeatcrossorlocked']
keep = ['firsttradeprice', 'hightradeprice', 'lowtradeprice', 'lasttradeprice', 'minspread', 'maxspread', 'volumeweightprice', 'nbboquotecount', 'tradeatbid', 'tradeatbidmid', 'tradeatmid', 'tradeatmidask', 'tradeatask', 'volume', 'totaltrades', 'finravolume', 'finravolumeweightprice', 'uptickvolume', 'downtickvolume', 'repeatuptickvolume', 'repeatdowntickvolume', 'tradetomidvolweight', 'tradetomidvolweightrelative']

We will shorten most of the field names to reduce typing:

columns = {'volumeweightprice' : 'price', 'finravolume' : 'fvolume', 'finravolumeweightprice' : 'fprice', 'uptickvolume' : 'up', 'downtickvolume' : 'down', 'repeatuptickvolume' : 'rup', 'repeatdowntickvolume' : 'rdown', 'firsttradeprice' : 'first', 'hightradeprice' : 'high', 'lowtradeprice' : 'low', 'lasttradeprice' : 'last', 'nbboquotecount' : 'nbbo', 'totaltrades' : 'ntrades', 'openbidprice' : 'obprice', 'openbidsize' : 'obsize', 'openaskprice' : 'oaprice', 'openasksize' : 'oasize', 'highbidprice' : 'hbprice', 'highbidsize' : 'hbsize', 'highaskprice' : 'haprice', 'highasksize' : 'hasize', 'lowbidprice' : 'lbprice', 'lowbidsize' : 'lbsize', 'lowaskprice' : 'laprice', 'lowasksize' : 'lasize', 'closebidprice' : 'cbprice', 'closebidsize' : 'cbsize', 'closeaskprice' : 'caprice', 'closeasksize' : 'casize', 'firsttradesize' : 'firstsize', 'hightradesize' : 'highsize', 'lowtradesize' : 'lowsize', 'lasttradesize' : 'lastsize', 'tradetomidvolweight' : 'volweight', 'tradetomidvolweightrelative': 'volweightrel'}

The Algoseek minute-bar data comes in compressed csv files that contain the data for one symbol and day, organized in three directories for each year (2015-17). The function extract_and_combine_data reads the ~80K source files and combines them into a single hdf5 file for faster access.

The data is fairly large (>8GB), and if you run into memory constraints, please modify the code to process the data in smaller chunks. One options is to iterate over the three directories containing data for a single year only, and storing each year separately.

nasdaq_path = Path('../data/nasdaq100')
def extract_and_combine_data(): path = nasdaq_path / '1min_taq' data = [] # ~80K files to process for f in tqdm(list(path.glob('*/**/*.csv.gz'))): data.append(pd.read_csv(f, parse_dates=[['Date', 'TimeBarStart']]) .rename(columns=str.lower) .drop(tcols + drop_cols, axis=1) .rename(columns=columns) .set_index('date_timebarstart') .sort_index() .between_time('9:30', '16:00') .set_index('ticker', append=True) .swaplevel() .rename(columns=lambda x: x.replace('tradeat', 'at'))) data = pd.concat(data).apply(pd.to_numeric, downcast='integer') data.index.rename(['ticker', 'date_time']) print(data.info(show_counts=True)) data.to_hdf(nasdaq_path / 'algoseek.h5', 'min_taq')
# extract_and_combine_data()

Loading Algoseek Data

ohlcv_cols = ['first', 'high', 'low', 'last', 'price', 'volume']
data_cols = ohlcv_cols + ['up', 'down', 'rup', 'rdown', 'atask', 'atbid']
with pd.HDFStore(as_path / 'algoseek.h5') as store: df = store['min_taq'].loc[:, data_cols].sort_index()
df['date'] = pd.to_datetime(df.index.get_level_values('date_time').date)

We persist the reduced dataset:

df.to_hdf('data/algoseek.h5', 'data')
df = pd.read_hdf('data/algoseek.h5', 'data')
df.info(null_counts=True)
<class 'pandas.core.frame.DataFrame'> MultiIndex: 31355463 entries, ('AAL', Timestamp('2015-01-02 09:30:00')) to ('YHOO', Timestamp('2017-06-16 16:00:00')) Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 first 30955838 non-null float64 1 high 30955838 non-null float64 2 low 30955838 non-null float64 3 last 30955838 non-null float64 4 price 30386944 non-null float64 5 volume 31355463 non-null int32 6 up 31355463 non-null int32 7 down 31355463 non-null int32 8 rup 31355463 non-null int32 9 rdown 31355463 non-null int32 10 atask 31355463 non-null int32 11 atbid 31355463 non-null int32 12 date 31355463 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(5), int32(7) memory usage: 2.4+ GB

Feature Engineering

All of the features above were normalized in a standard fashion by subtracting their means, dividing by their standard deviations, and time-averaging over a recent interval. In order to obtain a finite state space, features were discretized into bins in multiples of standard deviation units

We will compute feature per ticker or ticker and date:

by_ticker = df.sort_index().groupby('ticker', group_keys=False) by_ticker_date = df.sort_index().groupby(['ticker', 'date'])

Create empty DataFrame with original ticker/timestamp index to hold our features:

data = pd.DataFrame(index=df.index)
data['date'] = pd.factorize(df['date'], sort=True)[0]
data['minute'] = pd.to_timedelta(data.index.get_level_values('date_time').time.astype(str)) data.minute = (data.minute.dt.seconds.sub(data.minute.dt.seconds.min()).div(60).astype(int))

Lagged Returns

We create lagged returns with respect to first and last price per bar for each the past 10 minutes:

data[f'ret1min'] = df['last'].div(df['first']).sub(1)

1-min returns have rather heavy tails:

sns.kdeplot(data.ret1min.sample(n=100000));
Image in a Jupyter notebook
data.ret1min.describe(percentiles=np.arange(.1, 1, .1)).iloc[1:].apply(lambda x: f'{x:.3%}')
mean -0.000% std 0.086% min -12.448% 10% -0.075% 20% -0.041% 30% -0.023% 40% -0.009% 50% 0.000% 60% 0.009% 70% 0.022% 80% 0.040% 90% 0.074% max 13.392% Name: ret1min, dtype: object
print(f'Skew: {data.ret1min.skew():.2f} | Kurtosis: {data.ret1min.kurtosis():.2f}')
Skew: 0.63 | Kurtosis: 399.53

Intra-bar price moves with the highest returns:

data.join(df[['first', 'last']]).nlargest(10, columns=['ret1min'])

We compute similarly for the remaining periods:

for t in tqdm(range(2, 11)): data[f'ret{t}min'] = df['last'].div(by_ticker_date['first'].shift(t-1)).sub(1)
100%|██████████| 9/9 [00:20<00:00, 2.24s/it]

Forward Returns

We obtain our 1-min forward return target by shifting the one-period return by one minute into the past (which implies the assumption that we always enter and exit a position at those prices, also ignoring trading cost and potential market impact):

data['fwd1min'] = (data .sort_index() .groupby(['ticker', 'date']) .ret1min .shift(-1))
data = data.dropna(subset=['fwd1min'])
data.info(null_counts=True)
<class 'pandas.core.frame.DataFrame'> MultiIndex: 30875649 entries, ('AAL', Timestamp('2015-01-02 09:30:00')) to ('YHOO', Timestamp('2017-06-16 15:59:00')) Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 30875649 non-null int64 1 minute 30875649 non-null int64 2 ret1min 30612848 non-null float64 3 ret2min 30302846 non-null float64 4 ret3min 30220887 non-null float64 5 ret4min 30141503 non-null float64 6 ret5min 30063236 non-null float64 7 ret6min 29983969 non-null float64 8 ret7min 29903822 non-null float64 9 ret8min 29824607 non-null float64 10 ret9min 29745431 non-null float64 11 ret10min 29666821 non-null float64 12 fwd1min 30875649 non-null float64 dtypes: float64(11), int64(2) memory usage: 3.2+ GB

Normalized up/downtick volume

for f in ['up', 'down', 'rup', 'rdown']: data[f] = df.loc[:, f].div(df.volume).replace(np.inf, np.nan)
data.loc[:, ['rup', 'up', 'rdown', 'down']].describe(deciles)

Balance of Power

data['BOP'] = (by_ticker.apply(lambda x: talib.BOP(x['first'], x.high, x.low, x['last'])))

Commodity Channel Index

data['CCI'] = (by_ticker.apply(lambda x: talib.CCI(x.high, x.low, x['last'], timeperiod=14)))

Money Flow Index

data['MFI'] = (by_ticker.apply(lambda x: talib.MFI(x.high, x.low, x['last'], x.volume, timeperiod=14)))
data[['BOP', 'CCI', 'MFI']].describe(deciles)

Stochastic RSI

data['STOCHRSI'] = (by_ticker.apply(lambda x: talib.STOCHRSI(x['last'].ffill(), timeperiod=14, fastk_period=14, fastd_period=3, fastd_matype=0)[0]))

Stochastic Oscillator

def compute_stoch(x, fastk_period=14, slowk_period=3, slowk_matype=0, slowd_period=3, slowd_matype=0): slowk, slowd = talib.STOCH(x.high.ffill(), x.low.ffill(), x['last'].ffill(), fastk_period=fastk_period, slowk_period=slowk_period, slowk_matype=slowk_matype, slowd_period=slowd_period, slowd_matype=slowd_matype) return pd.DataFrame({'slowd': slowd, 'slowk': slowk}, index=x.index)
data = data.join(by_ticker.apply(compute_stoch))

Average True Range

data['NATR'] = by_ticker.apply(lambda x: talib.NATR(x.high.ffill(), x.low.ffill(), x['last'].ffill()))

Transaction Volume by price point

data['trades_bid_ask'] = df.atask.sub(df.atbid).div(df.volume).replace((np.inf, -np.inf), np.nan)
del df
data.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'> MultiIndex: 30875649 entries, ('AAL', Timestamp('2015-01-02 09:30:00')) to ('YHOO', Timestamp('2017-06-16 15:59:00')) Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 30875649 non-null int64 1 minute 30875649 non-null int64 2 ret1min 30612848 non-null float64 3 ret2min 30302846 non-null float64 4 ret3min 30220887 non-null float64 5 ret4min 30141503 non-null float64 6 ret5min 30063236 non-null float64 7 ret6min 29983969 non-null float64 8 ret7min 29903822 non-null float64 9 ret8min 29824607 non-null float64 10 ret9min 29745431 non-null float64 11 ret10min 29666821 non-null float64 12 fwd1min 30875649 non-null float64 13 up 30083777 non-null float64 14 down 30083777 non-null float64 15 rup 30083777 non-null float64 16 rdown 30083777 non-null float64 17 BOP 30612848 non-null float64 18 CCI 28517773 non-null float64 19 MFI 30873719 non-null float64 20 STOCHRSI 30871639 non-null float64 21 slowd 30873302 non-null float64 22 slowk 30873302 non-null float64 23 NATR 30873719 non-null float64 24 trades_bid_ask 30083777 non-null float64 dtypes: float64(23), int64(2) memory usage: 6.9+ GB

Evaluate features

features = ['ret1min', 'ret2min', 'ret3min', 'ret4min', 'ret5min', 'ret6min', 'ret7min', 'ret8min', 'ret9min', 'ret10min', 'rup', 'up', 'down', 'rdown', 'BOP', 'CCI', 'MFI', 'STOCHRSI', 'slowk', 'slowd', 'trades_bid_ask']
sample = data.sample(n=100000)
fig, axes = plt.subplots(nrows=3, ncols=7, figsize=(30, 12)) axes = axes.flatten() for i, feature in enumerate(features): sns.distplot(sample[feature], ax=axes[i]) axes[i].set_title(feature.upper()) sns.despine() fig.tight_layout()
Image in a Jupyter notebook
sns.pairplot(sample, y_vars=['fwd1min'], x_vars=features);
Image in a Jupyter notebook
corr = sample.loc[:, features].corr() sns.clustermap(corr, cmap = sns.diverging_palette(20, 230, as_cmap=True), center=0, vmin=-.25);
Image in a Jupyter notebook
ic = {} for feature in tqdm(features): df = data[['fwd1min', feature]].dropna() by_day = df.groupby(df.index.get_level_values('date_time').date) # calc per min is very time-consuming ic[feature] = by_day.apply(lambda x: spearmanr(x.fwd1min, x[feature])[0]).mean() ic = pd.Series(ic)
100%|██████████| 21/21 [04:44<00:00, 13.54s/it]
ic.sort_values()
STOCHRSI -0.015177 ret4min -0.013636 CCI -0.012663 ret5min -0.012534 ret3min -0.012235 ret9min -0.010546 ret8min -0.009978 ret2min -0.009834 ret7min -0.009678 ret10min -0.009596 ret1min -0.009468 ret6min -0.009047 rup -0.008965 BOP -0.006312 slowk -0.005720 trades_bid_ask -0.005406 MFI -0.003847 slowd -0.001772 up -0.000832 down 0.000038 rdown 0.010978 dtype: float64
title = 'Information Coeficient for Intraday Features (1-min forward returns)' ic.index = ic.index.map(str.upper) ax = ic.sort_values(ascending=False).plot.bar(figsize=(14, 4), title=title, rot=35) ax.set_ylabel('Information Coefficient') ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.1%}'.format(y))) sns.despine() plt.tight_layout();
Image in a Jupyter notebook

Store results

data.info(null_counts=True)
<class 'pandas.core.frame.DataFrame'> MultiIndex: 30875649 entries, ('AAL', Timestamp('2015-01-02 09:30:00')) to ('YHOO', Timestamp('2017-06-16 15:59:00')) Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 30875649 non-null int64 1 minute 30875649 non-null int64 2 ret1min 30612848 non-null float64 3 ret2min 30302846 non-null float64 4 ret3min 30220887 non-null float64 5 ret4min 30141503 non-null float64 6 ret5min 30063236 non-null float64 7 ret6min 29983969 non-null float64 8 ret7min 29903822 non-null float64 9 ret8min 29824607 non-null float64 10 ret9min 29745431 non-null float64 11 ret10min 29666821 non-null float64 12 fwd1min 30875649 non-null float64 13 up 30083777 non-null float64 14 down 30083777 non-null float64 15 rup 30083777 non-null float64 16 rdown 30083777 non-null float64 17 BOP 30612848 non-null float64 18 CCI 28517773 non-null float64 19 MFI 30873719 non-null float64 20 STOCHRSI 30871639 non-null float64 21 slowd 30873302 non-null float64 22 slowk 30873302 non-null float64 23 NATR 30873719 non-null float64 24 trades_bid_ask 30083777 non-null float64 dtypes: float64(23), int64(2) memory usage: 6.9+ GB
data.drop(['date', 'up', 'down'], axis=1).to_hdf('data/algoseek.h5', 'model_data')