CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.

| Download
Views: 3376
Image: ubuntu2004
Kernel: Python 3 (system-wide)

Pandas and first applications

Camilo A. Garcia Trillos - 2020

Objective

Our objective in this part of the course is to develop a general framework to apply the theory we have studied in practice.

We discuss several important elements to take into account in most applications.

This discussion is then illustrated by means of some simple applciations using actual market data and Python routines

Goal, Method, Model and Data

  • The first question is always what do we want to study/understand? i.e. we start by fixing our goal.

  • To answer this question, we choose a methodology (that is, a theoretical framework like the ones we presented in class).

  • The methodology is usually accompany by a model, a representation of the system we want to study

  • This model is fitted using relevant data

Our discussion today will be focused on the data

Data

  • Getting data

    • Which data? Relation between data, goal and model

    • How? Build, buy and/or gather

      • Public: https://data.gov.uk/

      • Payed, Research oriented: CRSP, Compustat , Wharton Research

      • Financial data provider: Bloomberg, Reuters (refinitiv), Quandl, Ravenspack

      • Free (limited): yahoo.finance

    • Know your data: tickers, meaning of different variables

What to do with data?

  • Understand data: Explore and visualise the data.

  • Clean data

    • Changes in time

    • Gaps

    • Frequencies

    • Outliers

    • Correct merging or manipulation errors

  • Make calculations

  • Understand and test results.

Using Data

Here, we are going to learn how to deal with some simple routines to work with data within Python, using the pandas package.

Students should have a look at things like SQL.

import numpy as np import matplotlib.pylab as plt import pandas as pd %matplotlib inline

Dataframes

Importing a database

I downloaded some data from Quandl (www.quandl.com) some time ago1.

The data is stored in a CSV (comma separated values) file. This is a popular way of saving in plain text column structured data. We will look at 5 years of stock prices for the Apple share from 2012 to 2017.

The following command reads a CSV file and creates a dataframe in pandas.


1 Before, it was possible to use a Google finance API to download financial information. It seems it is temporarily closed.

AAPL = pd.read_csv('~/Data/AAPL_20171201_5y.csv') # read the csv and create a dataframe called AAPL which is in the folder DATA
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-2-fa593d571b4d> in <module> ----> 1 AAPL = pd.read_csv('~/Data/AAPL_20171201_5y.csv') # read the csv and create a dataframe called AAPL which is in the folder DATA /usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 686 ) 687 --> 688 return _read(filepath_or_buffer, kwds) 689 690 /usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 452 453 # Create the parser. --> 454 parser = TextFileReader(fp_or_buf, **kwds) 455 456 if chunksize or iterator: /usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds) 946 self.options["has_index_names"] = kwds["has_index_names"] 947 --> 948 self._make_engine(self.engine) 949 950 def close(self): /usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in _make_engine(self, engine) 1178 def _make_engine(self, engine="c"): 1179 if engine == "c": -> 1180 self._engine = CParserWrapper(self.f, **self.options) 1181 else: 1182 if engine == "python": /usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds) 2008 kwds["usecols"] = self.usecols 2009 -> 2010 self._reader = parsers.TextReader(src, **kwds) 2011 self.unnamed_cols = self._reader.unnamed_cols 2012 pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source() FileNotFoundError: [Errno 2] No such file or directory: '/home/user/Data/AAPL_20171201_5y.csv'
type(AAPL)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-4f1e2edffa85> in <module> ----> 1 type(AAPL) NameError: name 'AAPL' is not defined

A dataframe is the main type of structure provided by Pandas: it implements the notion of database.

Let us look at some of the methods associated with a database

Describe: presents some general statistics on numerical data in the database.

AAPL.describe() # This command shows some general statistics on the data

The methods head and tail show respectively the first (5, if no argument given) and last entries of the database.

AAPL.head() # Show the first elements
AAPL.tail() # Show the last elements

These commands give us an idea of what is on the database: it is composed by many rows (1487) with 14 different named fields, plus one column with no-name and consecutive numbers. This column is an index, and we will come back to it later.

Note that we have a mix of different types of entries. Indeed, let us look at the types of each one of the columns.

AAPL.dtypes
ticker object date object open float64 high float64 low float64 close float64 volume float64 ex-dividend float64 split_ratio float64 adj_open float64 adj_high float64 adj_low float64 adj_close float64 adj_volume float64 dtype: object

The above command show us a list of all fields in the dataframe and the types of each one of them. Most of them are float numbers (float64) and two 'object' type (more on this in a moment).

Let us discuss the meaning of each one of these entries:

  • ticker: the identifier of the stock to which this information belongs. In this case, apple INC, which ticker is AAPL.

  • date: the date at which the remaining information is collected.

The following data are unadjusted, i.e., are presented as they occurred on the given date.

  • open: the price of the stock at the start of the trading day.

  • high: the maximum price the stock achieved during the trading day.

  • close : the price of the stock at the end of the trading day.

  • volume: the amount in dollars exchanged on this stock during the trading day

  • ex-dividend: The ex-dividend date is a financial concept: if you purchase a stock on its ex-dividend date or after, you will not receive the next dividend payment. Instead, the seller gets the dividend. If you purchase before the ex-dividend date, and hold the position until the ex-dividend date, you get the dividend. Now, this entry shows the dividend paid on any ex-dividend date else 0.0.

  • split_ratio: shows any split that occurred on a the given DATE or 1.0 otherwise.

The following data are adjusted. Adjusted data takes the information on dividends and splits and modifies the historical records to obtain prices as if no-dividend and no splits were defined. This is convenient for analysis where we work with total values.

  • adj_open

  • adj_high

  • adj_low

  • adj_close

  • adj_volume

Series

Dataframes are composed by series, which are, in turn a collection of individual (or scalar) entries.

We can extract and operate on series by passing the name of the entry to the database.

adjmax = AAPL['adj_high'] adjmin = AAPL['adj_low'] print(adjmax) print('type of adjmax', type(adjmax))
0 53.011999 1 53.292160 2 53.789509 3 54.329267 4 54.971837 ... 1483 175.080000 1484 174.870000 1485 172.920000 1486 172.140000 1487 171.670000 Name: adj_high, Length: 1488, dtype: float64 type of adjmax <class 'pandas.core.series.Series'>

Above we extracted the series of adjusted low and high. Note the corresponding indices are retrieved as well.

Several operations can then be performed on these series. For example, we can create a new series containing the rage of adjusted values for each date:

adjrange = adjmax-adjmin print(adjrange)
0 0.449799 1 0.693975 2 0.755662 3 0.453654 4 0.822489 ... 1483 1.740000 1484 3.010000 1485 5.760000 1486 3.700000 1487 3.170000 Length: 1488, dtype: float64

We can add the newly created ranges to our database as a new column, as follows:

AAPL['adj_range']=adjrange AAPL.head()

This is useful for producing calculated columns.

Checking for data integrity and fixing Indices

Dataframes and series are indexed. The role of the index is to allow us to retrieve a given particular row: it must be a unique identifier of each row of data.

Here are some examples:

print('The row of the dataframe indexed by the number 3: \n\n', AAPL.iloc[3]) print('\n\n ----- \n\n The row of the adjusted range series indexed by the number 3: \n\n', adjrange.iloc[3])
The row of the dataframe indexed by the number 3: ticker AAPL date 2012-01-06 open 419.77 high 422.75 low 419.22 close 422.4 volume 1.13676e+07 ex-dividend 0 split_ratio 1 adj_open 53.9463 adj_high 54.3293 adj_low 53.8756 adj_close 54.2843 adj_volume 7.95732e+07 adj_range 0.453654 Name: 3, dtype: object ----- The row of the adjusted range series indexed by the number 3: 0.45365419990600486

Unfortunately the automatic index that was created is not very informative. We can, instead recall our description of the data and set the date as the index. Before doing so, let us verify that it is, indeed, a date.

print(AAPL['date'][3], type(AAPL['date'][3]))
2012-01-06 <class 'str'>
AAPL['date']
0 2012-01-03 1 2012-01-04 2 2012-01-05 3 2012-01-06 4 2012-01-09 ... 1483 2017-11-27 1484 2017-11-28 1485 2017-11-29 1486 2017-11-30 1487 2017-12-01 Name: date, Length: 1488, dtype: object

We can see that the date is not treated as a date but as a string. The reason I suspected this is that when we looked at the types of each variable, the date was listed as 'object'. It would be convenient to recast it, that is, to transform it into a date.

We can modify any column by assigning the result of another series, just as we did to create a new column. In this case, we use a function to turn strings into dates.

AAPL['date']=pd.to_datetime(AAPL['date'])

We check again the types after the transformation

AAPL.dtypes
ticker object date datetime64[ns] open float64 high float64 low float64 close float64 volume float64 ex-dividend float64 split_ratio float64 adj_open float64 adj_high float64 adj_low float64 adj_close float64 adj_volume float64 adj_range float64 dtype: object
print(AAPL['date'][3], type(AAPL['date'][3]))
2012-01-06 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

Dates are now effectively treated as dates. This is convenient as there are many supported modules that act on dates. Now, as announced before, we can set the index to be the date.

Before setting the time as an index, let us check that there are no repeated date values (of course, we do not expect this).

print(AAPL[AAPL.duplicated('date', False)]) # Here we first produce a vector of TRUE or FALSE if dates are duplicated. Then show the whole entries if duplicated entries appear
Empty DataFrame Columns: [ticker, date, open, high, low, close, volume, ex-dividend, split_ratio, adj_open, adj_high, adj_low, adj_close, adj_volume, adj_range] Index: []

Indeed, there are no duplicated dates. Compare with this code (finding duplicated entries of 'Volume' )

print(AAPL[AAPL.duplicated('volume', False)]) # Here we first produce a vector of TRUE or FALSE if VOLUME data are duplicated. Then show the whole entries if duplicated entries appear
ticker date open high low close volume ex-dividend \ 563 AAPL 2014-04-01 537.76 541.87 536.77 541.65 7170000.0 0.0 599 AAPL 2014-05-22 606.60 609.85 604.10 607.27 7170000.0 0.0 split_ratio adj_open adj_high adj_low adj_close adj_volume \ 563 1.0 71.905792 72.455354 71.773415 72.425937 50190000.0 599 1.0 81.564473 82.001473 81.228318 81.654562 50190000.0 adj_range 563 0.681939 599 0.773155

We see that at two dates in the range we got the same exact amount of non-adjusted volume. See the help on duplicated for more about its options.

Now, having verified that there are no duplicates on the date, we can check that there are no missing data on the date. The database we obtained is high quality and no missing entries should remain, but we should always check for this.

AAPL.isnull().any() # This looks for missing values in any field of the database (TRUE if misisng FALSE if not). Then checks if tehre is at least one TRUE.
ticker False date False open False high False low False close False volume False ex-dividend False split_ratio False adj_open False adj_high False adj_low False adj_close False adj_volume False adj_range False dtype: bool

As shown above, none of the fields has missing values.

We are then ready to set the date field as the index with the following instruction

AAPL.set_index('date', inplace=True) # Set the index on the same database, not on a copy

Mind the inplace part of the call. Otherwise, the function returns a copy of the dataframe with the modified index.

Now, we can simply consult the data for a given date. For instance

AAPL.loc['2017-11-17']
ticker AAPL open 171.04 high 171.39 low 169.64 close 170.15 volume 2.16658e+07 ex-dividend 0 split_ratio 1 adj_open 171.04 adj_high 171.39 adj_low 169.64 adj_close 170.15 adj_volume 2.16658e+07 adj_range 1.75 Name: 2017-11-17 00:00:00, dtype: object

Note that this time we did not use 'iloc' but 'loc'. The reason is that 'iloc' is primarily to use with integers (the position), while 'loc' is used to work with labels, like the date we give above.

Now, if we want only a subset of the data, we can write

AAPL.loc['2017-11-17', ['volume','close']]
volume 2.16658e+07 close 170.15 Name: 2017-11-17 00:00:00, dtype: object

Sampling a dataframe

One of the advantages of having a date index is that we can resample the data. Indeed, let's say that we are interested in weekly returns, but we only have daily returns as above.

We can pick then resample the data using the power of pandas. Let us suppose we take now the data at of each Friday as a representative of the week. We would simply write

AAPL_week = AAPL.resample('W-FRI').last() # Sample the database on a weekly basis finishing on a Friday, and taking the last data of each interval
AAPL_week.head()

This takes the data in bits of one week, ending on a Friday, and taking the last value as a representative (i.e. the value on a Friday). To verify this, let us copare with the value on the original dataframe of the first Friday

AAPL.loc['2012-01-06']
ticker AAPL open 419.77 high 422.75 low 419.22 close 422.4 volume 1.13676e+07 ex-dividend 0 split_ratio 1 adj_open 53.9463 adj_high 54.3293 adj_low 53.8756 adj_close 54.2843 adj_volume 7.95732e+07 adj_range 0.453654 Name: 2012-01-06 00:00:00, dtype: object

Of course, some of the data is no longer an accurate summary. We would like to have the opening, high, low and volume of the week (and not the opening of the Friday). We can call different versions of the resample function to adjust for this. Look at the following code

AAPL_week[['open', 'adj_open']] = AAPL[['open', 'adj_open']].resample('W-FRI').first() AAPL_week[['high', 'adj_high']] = AAPL[['high', 'adj_high']].resample('W-FRI').max() AAPL_week[['low', 'adj_low']] = AAPL[['low', 'adj_low']].resample('W-FRI').min() AAPL_week[['volume', 'adj_volume', 'ex-dividend']] = AAPL[['volume', 'adj_volume', 'ex-dividend']].resample('W-FRI').sum() AAPL_week['split_ratio'] = AAPL['split_ratio'].resample('W-FRI').prod()
AAPL_week.head()

Warning: Resampling data should be used carefully, as we would need to account individually for splitting and dividends, which might distort the avaialble info. The above example is mainly illustrative, and we need to be careful when working with non-adjusted resampled data.

Visualising data

Plotting a price series

Let's next plot the daily closing prices, and then the weekly returns.

AAPL['close'].plot(title='Close price') plt.ylabel('Close price per share (USD)')
Text(0, 0.5, 'Close price per share (USD)')
Image in a Jupyter notebook

We see a huge drop in the closing value of apple shares around mid 2014... it seems like a very bad day.

Actually, the story is different. Let us look at the splits and dividends paid on this stock

AAPL[['ex-dividend','split_ratio']].plot()
<AxesSubplot:xlabel='date'>
Image in a Jupyter notebook

We can see that the hit in price is due to a split_ratio of 7 (i.e. each stock was exchanged for 7 new stocks on that day). This, of course, divided by 7 the value of each new stock.

To compensate for this, and the dividend effect, adjusted series are frequently used when analysing stocks historically. Look at this plot now

AAPL['adj_close'].plot(title='Daily close price (adjusted)') plt.ylabel('Close price adjusted per share (USD) ')
Text(0, 0.5, 'Close price adjusted per share (USD) ')
Image in a Jupyter notebook

This plot shows a very different story indeed!

Warning In certain cases, adjusted data can add the problem of the timing of when was the adjustment made: sometimes, like when backtesting investment strategies, these timings might induce to mistakes.

To check, let us plot the weekly equivalent.

AAPL_week['adj_close'].plot(title='Weekly close price (adjusted)') plt.ylabel('Close price adjusted per share (USD) ')
Text(0, 0.5, 'Close price adjusted per share (USD) ')
Image in a Jupyter notebook

Another useful plot appears when plotting simultaneously the adjusted high and low of each day. This gives us an idea of the intraday volatility

AAPL[['adj_high','adj_low']].plot(alpha=0.5) plt.title('High and low for Apple share (adjusted)') plt.xlabel('date') plt.ylabel('Adjusted price')
Text(0, 0.5, 'Adjusted price')
Image in a Jupyter notebook

A histogram of log-returns

Let's compute the weekly (gross) log returns. We can plot a series as a function of time as before. However, this time we might instead look at the histogram of past log-returns. We have

log(Rt+1)=log(St+1St)=log(St+1)log(St)\log\left( R_{t+1} \right) = \log\left( \frac{S_{t+1}}{S_t}\right) = \log( S_{t+1}) - \log({S_t})

A frequent assumption is that the sequence of log-returns is stationary. Note that this is exactly the type of assumption behind the geometric-Brownian motion model under which

St+1=Stexp(σZt+1)S_{t+1} = S_t \exp( \sigma Z_{t+1})

with Zt+1N(0,1)Z_{t+1} \sim \mathcal N(0,1) and σ\sigma a constant number.

In the following, we calculate the log-returns and check if the process seems to be stationary. If so, we look at its associated histogram.

We compute first the log prices and then find the differences.

logp = np.log(AAPL_week['adj_close']) # Note that we can apply dircetly numpy functions to series or dataframe logr = logp.diff() # This calculates the differences between entries logr.head()
date 2012-01-06 NaN 2012-01-13 -0.006151 2012-01-20 0.001167 2012-01-27 0.062216 2012-02-03 0.027346 Freq: W-FRI, Name: adj_close, dtype: float64

It worked well... except for the first entry (clearly we cannot calculate the return 0). We just drop this line as follows

logr.drop(logr.index[0], inplace=True) # Find what is the first index and drop the row that has it. Note this is another instance where inplace is needed logr.head()
date 2012-01-13 -0.006151 2012-01-20 0.001167 2012-01-27 0.062216 2012-02-03 0.027346 2012-02-10 0.070830 Freq: W-FRI, Name: adj_close, dtype: float64

Let us look at a time plot of the log returns

logr.plot(style='.') plt.title('Weekly log returns - Apple share') plt.ylabel('One-week log-returns')
Text(0, 0.5, 'One-week log-returns')
Image in a Jupyter notebook

The plot does not seem to show any trend in mean: points seems to oscillate around a stable value. There seems to be a very slight change in volatility (points seem to be more clustered to the right of the plot.)

We can use again the sampling functions with different summarising measures to check for the potential stability of means.

sd_diff = logr.resample('6M').std() sd_diff.plot(title='Moving standard deviation of log-returns. Window: 6 months. ', style='.') # Note that we do several commands here: first we resample by 6 months, then we sumarise with standard deviation and then we plot
<AxesSubplot:title={'center':'Moving standard deviation of log-returns. Window: 6 months. '}, xlabel='date'>
Image in a Jupyter notebook

We conclude that some oscillation in the realised 6 month volatility is observed. Note that even if the true volatility would be constant this might be expected, as a 6 month sample means taking the realised volatility of around 6*4 = 24 weeks. Which is still relatively small.

A statistical test could be performed to check if there is a first order dependence between two successive values.

Augmented Dickey-Fuller test: This statistical test is based on the fitting an autoregressive model to the data as follows:

Δyt+1=α+βt+γyt+i=0pδpΔytp+ϵt\Delta y_{t+1} = \alpha + \beta t + \gamma y_{t} + \sum_{i=0}^p \delta_p \Delta y_{t-p} +\epsilon_t

where Δys=ysys1\Delta y_s = y_s-y_{s-1}

If the process is stationary (up to at some trend), we expect a tendency to return to the mean: we then expect γ<0\gamma<0 in this case. The ADF test has as null hypothesis that γ=0\gamma =0 and checks against γ0\gamma\leq 0.

from statsmodels.tsa.stattools import adfuller
adfuller(logr)[1] # We reject the presence of a unit root if this value is very small (say less than 0.01)
5.670416651238719e-05

Since we reject the null assumption, we get γ0\gamma\leq 0. All in all, we do not have enough evidence to reject stationarity.

Can we fit a known distribution?

With the stationarity assumption, we can now focus on understanding if the data fits a know distribution. Let us look at the histogram of log-returns.

logr.hist(density=True, bins=35) plt.title('Histogram of weekly log-returns')
Text(0.5, 1.0, 'Histogram of weekly log-returns')
Image in a Jupyter notebook

It does not look very 'Gaussian'. For good measure, we use also a qqplot. It can be found on the stats submodule of Scipy

from scipy.stats import probplot probplot(logr, plot=plt); # The semicolon (;) avoids any output. plt=plot asks this function to plot the result using the matplotlib rendering engine
Image in a Jupyter notebook

The probplot function is a qqplot of the empirical weekly data against the best fit for a Gaussian distribution (using maximum likelihood). The result shows that the weekly returns do not look so far from a Gaussian, and for many applications they can be assumed so.

Note, however, that the Gaussian distribution largely sub-estimates negative returns (i.e. losses) that are larger than 2 standard deviations.

Indeed, as in the course, the quantiles of the approximation can be found by taking E[X]+sd(X)Φ1(X)E[X] + {\rm sd(X)}\Phi^{-1}(X). We get

mean_logr = logr.mean() sd_logr = logr.std() from scipy.stats import norm print('The lowest 1% quantile of returns: ', logr.quantile(0.01), '. The lowest 1% quantile using a Gaussian fit ', mean_logr + sd_logr*norm.ppf(0.01))
The lowest 1% quantile of returns: -0.09302811225797855 . The lowest 1% quantile using a Gaussian fit -0.07892222162000649

Clearly we subestimate possible losses. This is inconvenient for VaR computations.

Exercises

  1. Using the information on the Apple stock in this notebook, calculate a time series with the mean and standard deviation of daily returns within a month.

2 Assume that an investor buys 1000 USD worth of an Apple share on the 3 January 2012.

a) What would be the total value of the investment (including dividends) on December 1 2017?

b) Make a histogram with the daily net returns of this investor, and check for normality.

c) Find the 9090% VaR of the daily net returns from the empirical distribution.