Notebook Instructions
You can run the notebook document sequentially (one cell at a time) by pressing shift + enter. While a cell is running, a [*] will display on the left. When it has been run, a number will display indicating the order in which it was run in the notebook [8].
Enter edit mode by pressing Enter or using the mouse to click on a cell's editor area. Edit mode is indicated by a green cell border and a prompt showing in the editor area.
Data Preprocessing
In this IPython notebook, we will cover some of the useful data preprocessing methods like data cleaning and data resampling.
1. Data Cleaning
When you are working with raw data, instances of duplicate data, missing data or inconsistent data can occur. If the data is not cleaned then the trading strategy can give misleading performance results during the backtest or it can give incorrect buy/sell signals while trading.
Dealing with Duplicate Data
Duplicate data means same data values getting repeated for certain timestamps. In Python, we can detect the occurrence of duplicate timestamps using the duplicated method and remove them using the drop_duplicates method.
Dealing with Missing Data
Missing data is another data quality issue; this can be hard to detect especially in case of shorter time frames like tick-by-tick data. On time frames like minute data or daily data one can build a check in the code to detect missing data.
In Python, we can use the reindex method to detect missing data for the specific frequency. The NaN values can be filled using the previous values or by any other method.
Dealing with Inconsistent Data
Inconsistent data can occur in different forms, for example, a spike in the price series or volume. Checks and balances in the code can help detect such data inconsistency.
One can observe in the above table that there is a spike in the close price at timestamp of 09:18:00. To detect such spikes we can use a simple diff() method on the close price of the dataframe and specify the threshold. In the example below we have specified a threshold value of 25 to classify it as a spike.
The 25 is too low a values to use here.
2. Data Resampling
The Pandas ‘resample’ function can be used to convert a given time series in desired time frames like minutely, hourly, daily, or weekly. In this example, we will illustrate how to convert a 1-minute time series into a 3-minute time series.
We use the resample function which takes two arguments to create a resampled series. The first argument, '3Min' specifies the time period to resample and the second argument, 'label' determines which bin edge label to label bucket with.
Since we are dealing with OHLC data, we would want to have the same format and meaning of OHLC in the resampled series. This means that the 'OPEN' price in the resampled series needs to correspond to the open price of the first 1-minute bar out of the three 1-minute bars used for resampling. The 'HIGH' in the resampled series needs to be equal to the highest price of the three 1-minute bars. Similarly, the 'LOW' in the resampled series needs to be equal to the lowest price of the three 1-minute bars. The 'CLOSE' price in the resamples series needs to correspond to the close price of the last 1-minute bar of the three 1-minute bars used for resampling.
This is achieved by using the aggregate method and specifying the requirements explained above to the aggregate method in the form of a Python dictionary.