Path: blob/master/19_recurrent_neural_nets/00_build_dataset.ipynb
2923 views
Kernel: Python 3
Create a dataset formatted for RNN examples
Imports & Settings
In [1]:
In [2]:
In [3]:
In [4]:
Build daily dataset
In [5]:
In [6]:
Out[6]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 5698754 entries, (Timestamp('2010-01-04 00:00:00'), 'A') to (Timestamp('2017-12-29 00:00:00'), 'ZUMZ')
Data columns (total 2 columns):
# Column Dtype
--- ------ -----
0 adj_close float64
1 adj_volume float64
dtypes: float64(2)
memory usage: 109.5+ MB
Select most traded stocks
In [7]:
In [8]:
In [9]:
Out[9]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2013 entries, 2017-12-29 to 2010-01-04
Columns: 500 entries, AAPL to CNC
dtypes: float64(500)
memory usage: 7.7 MB
Stack 21-day time series
In [10]:
In [11]:
Out[11]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 995499 entries, ('A', Timestamp('2010-02-04 00:00:00')) to ('ZION', Timestamp('2017-12-29 00:00:00'))
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 995499 non-null float64
1 1 995499 non-null float64
2 2 995499 non-null float64
3 3 995499 non-null float64
4 4 995499 non-null float64
5 5 995499 non-null float64
6 6 995499 non-null float64
7 7 995499 non-null float64
8 8 995499 non-null float64
9 9 995499 non-null float64
10 10 995499 non-null float64
11 11 995499 non-null float64
12 12 995499 non-null float64
13 13 995499 non-null float64
14 14 995499 non-null float64
15 15 995499 non-null float64
16 16 995499 non-null float64
17 17 995499 non-null float64
18 18 995499 non-null float64
19 19 995499 non-null float64
20 20 995499 non-null float64
21 21 995499 non-null float64
dtypes: float64(22)
memory usage: 171.0+ MB
In [12]:
Out[12]:
(995499, 22)
In [13]:
Build weekly dataset
We load the Quandl adjusted stock price data:
In [14]:
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2896 entries, 2007-01-01 to 2018-03-27
Columns: 3199 entries, A to ZUMZ
dtypes: float64(3199)
memory usage: 70.7 MB
Resample to weekly frequency
We start by generating weekly returns for close to 2,500 stocks without missing data for the 2008-17 period, as follows:
In [15]:
Out[15]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 522 entries, 2017-12-31 to 2008-01-06
Freq: -1W-SUN
Columns: 2489 entries, A to ZUMZ
dtypes: float64(2489)
memory usage: 9.9 MB
In [16]:
Out[16]:
Create & stack 52-week sequences
We'll use 52-week sequences, which we'll create in a stacked format:
In [17]:
In [18]:
Out[18]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1167341 entries, ('A', Timestamp('2017-12-31 00:00:00')) to ('ZUMZ', Timestamp('2009-01-11 00:00:00'))
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 1167341 non-null float64
1 1 1167341 non-null float64
2 2 1167341 non-null float64
3 3 1167341 non-null float64
4 4 1167341 non-null float64
5 5 1167341 non-null float64
6 6 1167341 non-null float64
7 7 1167341 non-null float64
8 8 1167341 non-null float64
9 9 1167341 non-null float64
10 10 1167341 non-null float64
11 11 1167341 non-null float64
12 12 1167341 non-null float64
13 13 1167341 non-null float64
14 14 1167341 non-null float64
15 15 1167341 non-null float64
16 16 1167341 non-null float64
17 17 1167341 non-null float64
18 18 1167341 non-null float64
19 19 1167341 non-null float64
20 20 1167341 non-null float64
21 21 1167341 non-null float64
22 22 1167341 non-null float64
23 23 1167341 non-null float64
24 24 1167341 non-null float64
25 25 1167341 non-null float64
26 26 1167341 non-null float64
27 27 1167341 non-null float64
28 28 1167341 non-null float64
29 29 1167341 non-null float64
30 30 1167341 non-null float64
31 31 1167341 non-null float64
32 32 1167341 non-null float64
33 33 1167341 non-null float64
34 34 1167341 non-null float64
35 35 1167341 non-null float64
36 36 1167341 non-null float64
37 37 1167341 non-null float64
38 38 1167341 non-null float64
39 39 1167341 non-null float64
40 40 1167341 non-null float64
41 41 1167341 non-null float64
42 42 1167341 non-null float64
43 43 1167341 non-null float64
44 44 1167341 non-null float64
45 45 1167341 non-null float64
46 46 1167341 non-null float64
47 47 1167341 non-null float64
48 48 1167341 non-null float64
49 49 1167341 non-null float64
50 50 1167341 non-null float64
51 51 1167341 non-null float64
52 52 1167341 non-null float64
dtypes: float64(53)
memory usage: 476.6+ MB
In [19]:
In [20]:
In [21]:
In [22]:
Out[22]:
(1167341, 54)
In [23]: