Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Project: Ying Jia - MATH0094-2020/2021

Path: Notebooks / Week 8 / 8. Pandas and first applications.ipynb

Views: ³⁴⁵²
Image: ubuntu2004

Kernel: Python 3 (system-wide)

Pandas and first applications

Camilo A. Garcia Trillos - 2020

Objective

Our objective in this part of the course is to develop a general framework to apply the theory we have studied in practice.

We discuss several important elements to take into account in most applications.

This discussion is then illustrated by means of some simple applciations using actual market data and Python routines

Goal, Method, Model and Data

The first question is always what do we want to study/understand? i.e. we start by fixing our goal.
To answer this question, we choose a methodology (that is, a theoretical framework like the ones we presented in class).
The methodology is usually accompany by a model, a representation of the system we want to study
This model is fitted using relevant data

Our discussion today will be focused on the data

Data

Getting data
- Which data? Relation between data, goal and model
- How? Build, buy and/or gather
  - Public: https://data.gov.uk/
  - Payed, Research oriented: CRSP, Compustat , Wharton Research
  - Financial data provider: Bloomberg, Reuters (refinitiv), Quandl, Ravenspack
  - Free (limited): yahoo.finance
- Know your data: tickers, meaning of different variables

What to do with data?

Understand data: Explore and visualise the data.
Clean data
- Changes in time
- Gaps
- Frequencies
- Outliers
- Correct merging or manipulation errors
Make calculations
Understand and test results.

Using Data

Here, we are going to learn how to deal with some simple routines to work with data within Python, using the pandas package.

Students should have a look at things like SQL.

In [1]:

import numpy as np
import matplotlib.pylab as plt
import pandas as pd
%matplotlib inline

Dataframes

Importing a database

I downloaded some data from Quandl (www.quandl.com) some time ago¹.

The data is stored in a CSV (comma separated values) file. This is a popular way of saving in plain text column structured data. We will look at 5 years of stock prices for the Apple share from 2012 to 2017.

The following command reads a CSV file and creates a dataframe in pandas.

¹ Before, it was possible to use a Google finance API to download financial information. It seems it is temporarily closed.

In [2]:

AAPL = pd.read_csv('~/Data/AAPL_20171201_5y.csv') # read the csv and create a dataframe called AAPL which is in the folder DATA

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-2-fa593d571b4d> in <module>
----> 1 AAPL = pd.read_csv('~/Data/AAPL_20171201_5y.csv') # read the csv and create a dataframe called AAPL which is in the folder DATA

/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    686     )
    687 
--> 688     return _read(filepath_or_buffer, kwds)
    689 
    690 
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    452 
    453     # Create the parser.
--> 454     parser = TextFileReader(fp_or_buf, **kwds)
    455 
    456     if chunksize or iterator:
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    946             self.options["has_index_names"] = kwds["has_index_names"]
    947 
--> 948         self._make_engine(self.engine)
    949 
    950     def close(self):
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1178     def _make_engine(self, engine="c"):
   1179         if engine == "c":
-> 1180             self._engine = CParserWrapper(self.f, **self.options)
   1181         else:
   1182             if engine == "python":
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   2008         kwds["usecols"] = self.usecols
   2009 
-> 2010         self._reader = parsers.TextReader(src, **kwds)
   2011         self.unnamed_cols = self._reader.unnamed_cols
   2012 
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/Data/AAPL_20171201_5y.csv'

In [3]:

type(AAPL)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-4f1e2edffa85> in <module>
----> 1 type(AAPL)

NameError: name 'AAPL' is not defined

A dataframe is the main type of structure provided by Pandas: it implements the notion of database.

Let us look at some of the methods associated with a database

Describe: presents some general statistics on numerical data in the database.

In [4]:

AAPL.describe() # This command shows some general statistics on the data

The methods head and tail show respectively the first (5, if no argument given) and last entries of the database.

In [5]:

AAPL.head() # Show the first elements

In [6]:

AAPL.tail() # Show the last elements

These commands give us an idea of what is on the database: it is composed by many rows (1487) with 14 different named fields, plus one column with no-name and consecutive numbers. This column is an index, and we will come back to it later.

Note that we have a mix of different types of entries. Indeed, let us look at the types of each one of the columns.

In [7]:

AAPL.dtypes

ticker          object
date            object
open           float64
high           float64
low            float64
close          float64
volume         float64
ex-dividend    float64
split_ratio    float64
adj_open       float64
adj_high       float64
adj_low        float64
adj_close      float64
adj_volume     float64
dtype: object

The above command show us a list of all fields in the dataframe and the types of each one of them. Most of them are float numbers (float64) and two 'object' type (more on this in a moment).

Let us discuss the meaning of each one of these entries:

ticker: the identifier of the stock to which this information belongs. In this case, apple INC, which ticker is AAPL.
date: the date at which the remaining information is collected.

The following data are unadjusted, i.e., are presented as they occurred on the given date.

open: the price of the stock at the start of the trading day.
high: the maximum price the stock achieved during the trading day.
close : the price of the stock at the end of the trading day.
volume: the amount in dollars exchanged on this stock during the trading day
ex-dividend: The ex-dividend date is a financial concept: if you purchase a stock on its ex-dividend date or after, you will not receive the next dividend payment. Instead, the seller gets the dividend. If you purchase before the ex-dividend date, and hold the position until the ex-dividend date, you get the dividend. Now, this entry shows the dividend paid on any ex-dividend date else 0.0.
split_ratio: shows any split that occurred on a the given DATE or 1.0 otherwise.

The following data are adjusted. Adjusted data takes the information on dividends and splits and modifies the historical records to obtain prices as if no-dividend and no splits were defined. This is convenient for analysis where we work with total values.

adj_open
adj_high
adj_low
adj_close
adj_volume

Series

Dataframes are composed by series, which are, in turn a collection of individual (or scalar) entries.

We can extract and operate on series by passing the name of the entry to the database.

In [8]:

adjmax = AAPL['adj_high']
adjmin = AAPL['adj_low']

print(adjmax)
print('type of adjmax', type(adjmax))

      53.011999
      53.292160
      53.789509
      54.329267
      54.971837
           ...    
  175.080000
  174.870000
  172.920000
  172.140000
  171.670000
Name: adj_high, Length: 1488, dtype: float64
type of adjmax <class 'pandas.core.series.Series'>

Above we extracted the series of adjusted low and high. Note the corresponding indices are retrieved as well.

Several operations can then be performed on these series. For example, we can create a new series containing the rage of adjusted values for each date:

In [9]:

adjrange = adjmax-adjmin
print(adjrange)

     0.449799
     0.693975
     0.755662
     0.453654
     0.822489
          ...   
  1.740000
  3.010000
  5.760000
  3.700000
  3.170000
Length: 1488, dtype: float64

We can add the newly created ranges to our database as a new column, as follows:

In [10]:

AAPL['adj_range']=adjrange
AAPL.head()

This is useful for producing calculated columns.

Checking for data integrity and fixing Indices

Dataframes and series are indexed. The role of the index is to allow us to retrieve a given particular row: it must be a unique identifier of each row of data.

Here are some examples:

In [11]:

print('The row of the dataframe indexed by the number 3: \n\n', AAPL.iloc[3])
print('\n\n ----- \n\n The row of the adjusted range series indexed by the number 3: \n\n', adjrange.iloc[3])

The row of the dataframe indexed by the number 3: 

 ticker                AAPL
date            2012-01-06
open                419.77
high                422.75
low                 419.22
close                422.4
volume         1.13676e+07
ex-dividend              0
split_ratio              1
adj_open           53.9463
adj_high           54.3293
adj_low            53.8756
adj_close          54.2843
adj_volume     7.95732e+07
adj_range         0.453654
Name: 3, dtype: object


 ----- 

 The row of the adjusted range series indexed by the number 3: 

 0.45365419990600486

Unfortunately the automatic index that was created is not very informative. We can, instead recall our description of the data and set the date as the index. Before doing so, let us verify that it is, indeed, a date.

In [12]:

print(AAPL['date'][3], type(AAPL['date'][3]))

2012-01-06 <class 'str'>

In [13]:

AAPL['date']

     2012-01-03
     2012-01-04
     2012-01-05
     2012-01-06
     2012-01-09
           ...    
  2017-11-27
  2017-11-28
  2017-11-29
  2017-11-30
  2017-12-01
Name: date, Length: 1488, dtype: object

We can see that the date is not treated as a date but as a string. The reason I suspected this is that when we looked at the types of each variable, the date was listed as 'object'. It would be convenient to recast it, that is, to transform it into a date.

We can modify any column by assigning the result of another series, just as we did to create a new column. In this case, we use a function to turn strings into dates.

In [14]:

AAPL['date']=pd.to_datetime(AAPL['date'])

We check again the types after the transformation

In [15]:

AAPL.dtypes

ticker                 object
date           datetime64[ns]
open                  float64
high                  float64
low                   float64
close                 float64
volume                float64
ex-dividend           float64
split_ratio           float64
adj_open              float64
adj_high              float64
adj_low               float64
adj_close             float64
adj_volume            float64
adj_range             float64
dtype: object

In [16]:

print(AAPL['date'][3], type(AAPL['date'][3]))

2012-01-06 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Dates are now effectively treated as dates. This is convenient as there are many supported modules that act on dates. Now, as announced before, we can set the index to be the date.

Before setting the time as an index, let us check that there are no repeated date values (of course, we do not expect this).

In [17]:

print(AAPL[AAPL.duplicated('date', False)]) # Here we first produce a vector of TRUE or FALSE if dates are duplicated. Then show the whole entries if duplicated entries appear

Empty DataFrame
Columns: [ticker, date, open, high, low, close, volume, ex-dividend, split_ratio, adj_open, adj_high, adj_low, adj_close, adj_volume, adj_range]
Index: []

Indeed, there are no duplicated dates. Compare with this code (finding duplicated entries of 'Volume' )

In [18]:

print(AAPL[AAPL.duplicated('volume', False)]) # Here we first produce a vector of TRUE or FALSE if VOLUME data are duplicated. Then show the whole entries if duplicated entries appear

    ticker       date    open    high     low   close     volume  ex-dividend  \
563   AAPL 2014-04-01  537.76  541.87  536.77  541.65  7170000.0          0.0   
599   AAPL 2014-05-22  606.60  609.85  604.10  607.27  7170000.0          0.0   

     split_ratio   adj_open   adj_high    adj_low  adj_close  adj_volume  \
563          1.0  71.905792  72.455354  71.773415  72.425937  50190000.0   
599          1.0  81.564473  82.001473  81.228318  81.654562  50190000.0   

     adj_range  
563   0.681939  
599   0.773155  

We see that at two dates in the range we got the same exact amount of non-adjusted volume. See the help on duplicated for more about its options.

Now, having verified that there are no duplicates on the date, we can check that there are no missing data on the date. The database we obtained is high quality and no missing entries should remain, but we should always check for this.

In [19]:

AAPL.isnull().any() # This looks for missing values in any field of the database (TRUE if misisng FALSE if not). Then checks if tehre is at least one  TRUE.

ticker         False
date           False
open           False
high           False
low            False
close          False
volume         False
ex-dividend    False
split_ratio    False
adj_open       False
adj_high       False
adj_low        False
adj_close      False
adj_volume     False
adj_range      False
dtype: bool

As shown above, none of the fields has missing values.

We are then ready to set the date field as the index with the following instruction

In [20]:

AAPL.set_index('date', inplace=True) # Set the index  on the same database, not on a copy

Mind the inplace part of the call. Otherwise, the function returns a copy of the dataframe with the modified index.

Now, we can simply consult the data for a given date. For instance

In [21]:

AAPL.loc['2017-11-17']

ticker                AAPL
open                171.04
high                171.39
low                 169.64
close               170.15
volume         2.16658e+07
ex-dividend              0
split_ratio              1
adj_open            171.04
adj_high            171.39
adj_low             169.64
adj_close           170.15
adj_volume     2.16658e+07
adj_range             1.75
Name: 2017-11-17 00:00:00, dtype: object

Note that this time we did not use 'iloc' but 'loc'. The reason is that 'iloc' is primarily to use with integers (the position), while 'loc' is used to work with labels, like the date we give above.

Now, if we want only a subset of the data, we can write

In [22]:

AAPL.loc['2017-11-17', ['volume','close']]

volume    2.16658e+07
close          170.15
Name: 2017-11-17 00:00:00, dtype: object

Sampling a dataframe

One of the advantages of having a date index is that we can resample the data. Indeed, let's say that we are interested in weekly returns, but we only have daily returns as above.

We can pick then resample the data using the power of pandas. Let us suppose we take now the data at of each Friday as a representative of the week. We would simply write

In [23]:

AAPL_week = AAPL.resample('W-FRI').last() # Sample the database on a weekly basis finishing on a Friday, and taking the last data of each interval

In [24]:

AAPL_week.head()

This takes the data in bits of one week, ending on a Friday, and taking the last value as a representative (i.e. the value on a Friday). To verify this, let us copare with the value on the original dataframe of the first Friday

In [25]:

AAPL.loc['2012-01-06']

ticker                AAPL
open                419.77
high                422.75
low                 419.22
close                422.4
volume         1.13676e+07
ex-dividend              0
split_ratio              1
adj_open           53.9463
adj_high           54.3293
adj_low            53.8756
adj_close          54.2843
adj_volume     7.95732e+07
adj_range         0.453654
Name: 2012-01-06 00:00:00, dtype: object

Of course, some of the data is no longer an accurate summary. We would like to have the opening, high, low and volume of the week (and not the opening of the Friday). We can call different versions of the resample function to adjust for this. Look at the following code

In [26]:

AAPL_week[['open', 'adj_open']] =  AAPL[['open', 'adj_open']].resample('W-FRI').first()
AAPL_week[['high', 'adj_high']] =  AAPL[['high', 'adj_high']].resample('W-FRI').max()
AAPL_week[['low', 'adj_low']] =  AAPL[['low', 'adj_low']].resample('W-FRI').min()
AAPL_week[['volume', 'adj_volume', 'ex-dividend']] =  AAPL[['volume', 'adj_volume', 'ex-dividend']].resample('W-FRI').sum()
AAPL_week['split_ratio'] =  AAPL['split_ratio'].resample('W-FRI').prod()

In [27]:

AAPL_week.head()

Warning: Resampling data should be used carefully, as we would need to account individually for splitting and dividends, which might distort the avaialble info. The above example is mainly illustrative, and we need to be careful when working with non-adjusted resampled data.

Visualising data

Plotting a price series

Let's next plot the daily closing prices, and then the weekly returns.

In [28]:

AAPL['close'].plot(title='Close price')
plt.ylabel('Close price per share (USD)')

Text(0, 0.5, 'Close price per share (USD)')

We see a huge drop in the closing value of apple shares around mid 2014... it seems like a very bad day.

Actually, the story is different. Let us look at the splits and dividends paid on this stock

In [43]:

AAPL[['ex-dividend','split_ratio']].plot()

<AxesSubplot:xlabel='date'>

We can see that the hit in price is due to a split_ratio of 7 (i.e. each stock was exchanged for 7 new stocks on that day). This, of course, divided by 7 the value of each new stock.

To compensate for this, and the dividend effect, adjusted series are frequently used when analysing stocks historically. Look at this plot now

In [30]:

AAPL['adj_close'].plot(title='Daily close price (adjusted)')
plt.ylabel('Close price adjusted per share (USD) ')

Text(0, 0.5, 'Close price adjusted per share (USD) ')

This plot shows a very different story indeed!

Warning In certain cases, adjusted data can add the problem of the timing of when was the adjustment made: sometimes, like when backtesting investment strategies, these timings might induce to mistakes.

To check, let us plot the weekly equivalent.

In [31]:

AAPL_week['adj_close'].plot(title='Weekly close price (adjusted)')
plt.ylabel('Close price adjusted per share (USD) ')

Text(0, 0.5, 'Close price adjusted per share (USD) ')

Another useful plot appears when plotting simultaneously the adjusted high and low of each day. This gives us an idea of the intraday volatility

In [32]:

AAPL[['adj_high','adj_low']].plot(alpha=0.5)
plt.title('High and low for Apple share (adjusted)')
plt.xlabel('date')
plt.ylabel('Adjusted price')

Text(0, 0.5, 'Adjusted price')

A histogram of log-returns

Let's compute the weekly (gross) log returns. We can plot a series as a function of time as before. However, this time we might instead look at the histogram of past log-returns. We have

\log\left( R_{t+1} \right) = \log\left( \frac{S_{t+1}}{S_t}\right) = \log( S_{t+1}) - \log({S_t})

A frequent assumption is that the sequence of log-returns is stationary. Note that this is exactly the type of assumption behind the geometric-Brownian motion model under which

S_{t+1} = S_t \exp( \sigma Z_{t+1})

with $Z_{t+1} \sim \mathcal N(0,1)$ and $\sigma$ a constant number.

In the following, we calculate the log-returns and check if the process seems to be stationary. If so, we look at its associated histogram.

We compute first the log prices and then find the differences.

In [33]:

logp = np.log(AAPL_week['adj_close']) # Note that we can apply dircetly numpy functions to series or dataframe
logr = logp.diff()  # This calculates the differences between entries
logr.head()

date
2012-01-06         NaN
2012-01-13   -0.006151
2012-01-20    0.001167
2012-01-27    0.062216
2012-02-03    0.027346
Freq: W-FRI, Name: adj_close, dtype: float64

It worked well... except for the first entry (clearly we cannot calculate the return 0). We just drop this line as follows

In [34]:

logr.drop(logr.index[0], inplace=True)   # Find what is the first index and drop the row that has it. Note this is another instance where inplace is needed
logr.head()

date
2012-01-13   -0.006151
2012-01-20    0.001167
2012-01-27    0.062216
2012-02-03    0.027346
2012-02-10    0.070830
Freq: W-FRI, Name: adj_close, dtype: float64

Let us look at a time plot of the log returns

In [35]:

logr.plot(style='.')
plt.title('Weekly log returns - Apple share')
plt.ylabel('One-week log-returns')

Text(0, 0.5, 'One-week log-returns')

The plot does not seem to show any trend in mean: points seems to oscillate around a stable value. There seems to be a very slight change in volatility (points seem to be more clustered to the right of the plot.)

We can use again the sampling functions with different summarising measures to check for the potential stability of means.

In [44]:

sd_diff = logr.resample('6M').std()
sd_diff.plot(title='Moving standard deviation of log-returns. Window: 6 months. ', style='.')

# Note that we do several commands here: first we resample by 6 months, then we sumarise with standard deviation and  then we plot

<AxesSubplot:title={'center':'Moving standard deviation of log-returns. Window: 6 months. '}, xlabel='date'>

We conclude that some oscillation in the realised 6 month volatility is observed. Note that even if the true volatility would be constant this might be expected, as a 6 month sample means taking the realised volatility of around 6*4 = 24 weeks. Which is still relatively small.

A statistical test could be performed to check if there is a first order dependence between two successive values.

Augmented Dickey-Fuller test: This statistical test is based on the fitting an autoregressive model to the data as follows:

\Delta y_{t+1} = \alpha + \beta t + \gamma y_{t} + \sum_{i=0}^p \delta_p \Delta y_{t-p} +\epsilon_t

where $\Delta y_s = y_s-y_{s-1}$

If the process is stationary (up to at some trend), we expect a tendency to return to the mean: we then expect $\gamma<0$ in this case. The ADF test has as null hypothesis that $\gamma =0$ and checks against $\gamma\leq 0$ .

In [45]:

from statsmodels.tsa.stattools import adfuller

In [38]:

adfuller(logr)[1] # We reject the presence of a unit root if this value is very small (say less than 0.01)

5.670416651238719e-05

Since we reject the null assumption, we get $\gamma\leq 0$ . All in all, we do not have enough evidence to reject stationarity.

Can we fit a known distribution?

With the stationarity assumption, we can now focus on understanding if the data fits a know distribution. Let us look at the histogram of log-returns.

In [39]:

logr.hist(density=True, bins=35)
plt.title('Histogram of weekly log-returns')

Text(0.5, 1.0, 'Histogram of weekly log-returns')

It does not look very 'Gaussian'. For good measure, we use also a qqplot. It can be found on the stats submodule of Scipy

In [40]:

from scipy.stats import probplot
probplot(logr, plot=plt); # The semicolon (;) avoids any output. plt=plot asks this function to plot the result using the matplotlib rendering engine

The probplot function is a qqplot of the empirical weekly data against the best fit for a Gaussian distribution (using maximum likelihood). The result shows that the weekly returns do not look so far from a Gaussian, and for many applications they can be assumed so.

Note, however, that the Gaussian distribution largely sub-estimates negative returns (i.e. losses) that are larger than 2 standard deviations.

Indeed, as in the course, the quantiles of the approximation can be found by taking $E[X] + {\rm sd(X)}\Phi^{-1}(X)$ . We get

In [41]:

mean_logr = logr.mean()
sd_logr = logr.std()
from scipy.stats import norm

print('The lowest 1% quantile of returns: ', logr.quantile(0.01), '. The lowest 1% quantile using a Gaussian fit ', mean_logr + sd_logr*norm.ppf(0.01))

The lowest 1% quantile of returns:  -0.09302811225797855 . The lowest 1% quantile using a Gaussian fit  -0.07892222162000649

Clearly we subestimate possible losses. This is inconvenient for VaR computations.

Exercises

Using the information on the Apple stock in this notebook, calculate a time series with the mean and standard deviation of daily returns within a month.

In [0]:

2 Assume that an investor buys 1000 USD worth of an Apple share on the 3 January 2012.

a) What would be the total value of the investment (including dividends) on December 1 2017?

b) Make a histogram with the daily net returns of this investor, and check for normality.

c) Find the $90%$ VaR of the daily net returns from the empirical distribution.

In [0]:

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Table of Contents

Pandas and first applications

Objective

Goal, Method, Model and Data

Data

What to do with data?

Using Data

Dataframes

Importing a database

Series

Checking for data integrity and fixing Indices

Sampling a dataframe

Visualising data

Plotting a price series

A histogram of log-returns

Can we fit a known distribution?

Exercises

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Table of Contents

Pandas and first applications

Objective

Goal, Method, Model and Data

Data

What to do with data?

Using Data

Dataframes

Importing a database

Series

Checking for data integrity and fixing Indices

Sampling a dataframe

Visualising data

Plotting a price series

A histogram of log-returns

Can we fit a known distribution?

Exercises

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.