Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Image: ubuntu2004
Pandas and first applications
Camilo A. Garcia Trillos - 2020
Objective
Our objective in this part of the course is to develop a general framework to apply the theory we have studied in practice.
We discuss several important elements to take into account in most applications.
This discussion is then illustrated by means of some simple applciations using actual market data and Python routines
Goal, Method, Model and Data
The first question is always what do we want to study/understand? i.e. we start by fixing our goal.
To answer this question, we choose a methodology (that is, a theoretical framework like the ones we presented in class).
The methodology is usually accompany by a model, a representation of the system we want to study
This model is fitted using relevant data
Our discussion today will be focused on the data
Data
Getting data
Which data? Relation between data, goal and model
How? Build, buy and/or gather
Public: https://data.gov.uk/
Payed, Research oriented: CRSP, Compustat , Wharton Research
Financial data provider: Bloomberg, Reuters (refinitiv), Quandl, Ravenspack
Free (limited): yahoo.finance
Know your data: tickers, meaning of different variables
What to do with data?
Understand data: Explore and visualise the data.
Clean data
Changes in time
Gaps
Frequencies
Outliers
Correct merging or manipulation errors
Make calculations
Understand and test results.
Using Data
Here, we are going to learn how to deal with some simple routines to work with data within Python, using the pandas package.
Students should have a look at things like SQL.
Dataframes
Importing a database
I downloaded some data from Quandl (www.quandl.com) some time ago1.
The data is stored in a CSV (comma separated values) file. This is a popular way of saving in plain text column structured data. We will look at 5 years of stock prices for the Apple share from 2012 to 2017.
The following command reads a CSV file and creates a dataframe in pandas.
1 Before, it was possible to use a Google finance API to download financial information. It seems it is temporarily closed.
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-2-fa593d571b4d> in <module>
----> 1 AAPL = pd.read_csv('~/Data/AAPL_20171201_5y.csv') # read the csv and create a dataframe called AAPL which is in the folder DATA
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
686 )
687
--> 688 return _read(filepath_or_buffer, kwds)
689
690
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
452
453 # Create the parser.
--> 454 parser = TextFileReader(fp_or_buf, **kwds)
455
456 if chunksize or iterator:
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
946 self.options["has_index_names"] = kwds["has_index_names"]
947
--> 948 self._make_engine(self.engine)
949
950 def close(self):
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in _make_engine(self, engine)
1178 def _make_engine(self, engine="c"):
1179 if engine == "c":
-> 1180 self._engine = CParserWrapper(self.f, **self.options)
1181 else:
1182 if engine == "python":
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
2008 kwds["usecols"] = self.usecols
2009
-> 2010 self._reader = parsers.TextReader(src, **kwds)
2011 self.unnamed_cols = self._reader.unnamed_cols
2012
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/Data/AAPL_20171201_5y.csv'
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-4f1e2edffa85> in <module>
----> 1 type(AAPL)
NameError: name 'AAPL' is not defined
A dataframe is the main type of structure provided by Pandas: it implements the notion of database.
Let us look at some of the methods associated with a database
Describe: presents some general statistics on numerical data in the database.
The methods head and tail show respectively the first (5, if no argument given) and last entries of the database.
These commands give us an idea of what is on the database: it is composed by many rows (1487) with 14 different named fields, plus one column with no-name and consecutive numbers. This column is an index, and we will come back to it later.
Note that we have a mix of different types of entries. Indeed, let us look at the types of each one of the columns.
The above command show us a list of all fields in the dataframe and the types of each one of them. Most of them are float numbers (float64) and two 'object' type (more on this in a moment).
Let us discuss the meaning of each one of these entries:
ticker: the identifier of the stock to which this information belongs. In this case, apple INC, which ticker is AAPL.
date: the date at which the remaining information is collected.
The following data are unadjusted, i.e., are presented as they occurred on the given date.
open: the price of the stock at the start of the trading day.
high: the maximum price the stock achieved during the trading day.
close : the price of the stock at the end of the trading day.
volume: the amount in dollars exchanged on this stock during the trading day
ex-dividend: The ex-dividend date is a financial concept: if you purchase a stock on its ex-dividend date or after, you will not receive the next dividend payment. Instead, the seller gets the dividend. If you purchase before the ex-dividend date, and hold the position until the ex-dividend date, you get the dividend. Now, this entry shows the dividend paid on any ex-dividend date else 0.0.
split_ratio: shows any split that occurred on a the given DATE or 1.0 otherwise.
The following data are adjusted. Adjusted data takes the information on dividends and splits and modifies the historical records to obtain prices as if no-dividend and no splits were defined. This is convenient for analysis where we work with total values.
adj_open
adj_high
adj_low
adj_close
adj_volume
Series
Dataframes are composed by series, which are, in turn a collection of individual (or scalar) entries.
We can extract and operate on series by passing the name of the entry to the database.
Above we extracted the series of adjusted low and high. Note the corresponding indices are retrieved as well.
Several operations can then be performed on these series. For example, we can create a new series containing the rage of adjusted values for each date:
We can add the newly created ranges to our database as a new column, as follows:
This is useful for producing calculated columns.
Checking for data integrity and fixing Indices
Dataframes and series are indexed. The role of the index is to allow us to retrieve a given particular row: it must be a unique identifier of each row of data.
Here are some examples:
Unfortunately the automatic index that was created is not very informative. We can, instead recall our description of the data and set the date as the index. Before doing so, let us verify that it is, indeed, a date.
We can see that the date is not treated as a date but as a string. The reason I suspected this is that when we looked at the types of each variable, the date was listed as 'object'. It would be convenient to recast it, that is, to transform it into a date.
We can modify any column by assigning the result of another series, just as we did to create a new column. In this case, we use a function to turn strings into dates.
We check again the types after the transformation
Dates are now effectively treated as dates. This is convenient as there are many supported modules that act on dates. Now, as announced before, we can set the index to be the date.
Before setting the time as an index, let us check that there are no repeated date values (of course, we do not expect this).
Indeed, there are no duplicated dates. Compare with this code (finding duplicated entries of 'Volume' )
We see that at two dates in the range we got the same exact amount of non-adjusted volume. See the help on duplicated for more about its options.
Now, having verified that there are no duplicates on the date, we can check that there are no missing data on the date. The database we obtained is high quality and no missing entries should remain, but we should always check for this.
As shown above, none of the fields has missing values.
We are then ready to set the date field as the index with the following instruction
Mind the inplace part of the call. Otherwise, the function returns a copy of the dataframe with the modified index.
Now, we can simply consult the data for a given date. For instance
Note that this time we did not use 'iloc' but 'loc'. The reason is that 'iloc' is primarily to use with integers (the position), while 'loc' is used to work with labels, like the date we give above.
Now, if we want only a subset of the data, we can write
Sampling a dataframe
One of the advantages of having a date index is that we can resample the data. Indeed, let's say that we are interested in weekly returns, but we only have daily returns as above.
We can pick then resample the data using the power of pandas. Let us suppose we take now the data at of each Friday as a representative of the week. We would simply write
This takes the data in bits of one week, ending on a Friday, and taking the last value as a representative (i.e. the value on a Friday). To verify this, let us copare with the value on the original dataframe of the first Friday
Of course, some of the data is no longer an accurate summary. We would like to have the opening, high, low and volume of the week (and not the opening of the Friday). We can call different versions of the resample function to adjust for this. Look at the following code
Warning: Resampling data should be used carefully, as we would need to account individually for splitting and dividends, which might distort the avaialble info. The above example is mainly illustrative, and we need to be careful when working with non-adjusted resampled data.
Visualising data
Plotting a price series
Let's next plot the daily closing prices, and then the weekly returns.
We see a huge drop in the closing value of apple shares around mid 2014... it seems like a very bad day.
Actually, the story is different. Let us look at the splits and dividends paid on this stock
We can see that the hit in price is due to a split_ratio of 7 (i.e. each stock was exchanged for 7 new stocks on that day). This, of course, divided by 7 the value of each new stock.
To compensate for this, and the dividend effect, adjusted series are frequently used when analysing stocks historically. Look at this plot now
This plot shows a very different story indeed!
Warning In certain cases, adjusted data can add the problem of the timing of when was the adjustment made: sometimes, like when backtesting investment strategies, these timings might induce to mistakes.
To check, let us plot the weekly equivalent.
Another useful plot appears when plotting simultaneously the adjusted high and low of each day. This gives us an idea of the intraday volatility
A histogram of log-returns
Let's compute the weekly (gross) log returns. We can plot a series as a function of time as before. However, this time we might instead look at the histogram of past log-returns. We have
A frequent assumption is that the sequence of log-returns is stationary. Note that this is exactly the type of assumption behind the geometric-Brownian motion model under which
with and a constant number.
In the following, we calculate the log-returns and check if the process seems to be stationary. If so, we look at its associated histogram.
We compute first the log prices and then find the differences.
It worked well... except for the first entry (clearly we cannot calculate the return 0). We just drop this line as follows
Let us look at a time plot of the log returns
The plot does not seem to show any trend in mean: points seems to oscillate around a stable value. There seems to be a very slight change in volatility (points seem to be more clustered to the right of the plot.)
We can use again the sampling functions with different summarising measures to check for the potential stability of means.
We conclude that some oscillation in the realised 6 month volatility is observed. Note that even if the true volatility would be constant this might be expected, as a 6 month sample means taking the realised volatility of around 6*4 = 24 weeks. Which is still relatively small.
A statistical test could be performed to check if there is a first order dependence between two successive values.
Augmented Dickey-Fuller test: This statistical test is based on the fitting an autoregressive model to the data as follows:
where
If the process is stationary (up to at some trend), we expect a tendency to return to the mean: we then expect in this case. The ADF test has as null hypothesis that and checks against .
Since we reject the null assumption, we get . All in all, we do not have enough evidence to reject stationarity.
Can we fit a known distribution?
With the stationarity assumption, we can now focus on understanding if the data fits a know distribution. Let us look at the histogram of log-returns.
It does not look very 'Gaussian'. For good measure, we use also a qqplot. It can be found on the stats submodule of Scipy
The probplot function is a qqplot of the empirical weekly data against the best fit for a Gaussian distribution (using maximum likelihood). The result shows that the weekly returns do not look so far from a Gaussian, and for many applications they can be assumed so.
Note, however, that the Gaussian distribution largely sub-estimates negative returns (i.e. losses) that are larger than 2 standard deviations.
Indeed, as in the course, the quantiles of the approximation can be found by taking . We get
Clearly we subestimate possible losses. This is inconvenient for VaR computations.
Exercises
Using the information on the Apple stock in this notebook, calculate a time series with the mean and standard deviation of daily returns within a month.
2 Assume that an investor buys 1000 USD worth of an Apple share on the 3 January 2012.
a) What would be the total value of the investment (including dividends) on December 1 2017?
b) Make a histogram with the daily net returns of this investor, and check for normality.
c) Find the VaR of the daily net returns from the empirical distribution.