Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

Data Preprocessing

1866 views
Kernel: Python 3 (Ubuntu Linux)

Notebook Instructions

You can run the notebook document sequentially (one cell at a time) by pressing shift + enter. While a cell is running, a [*] will display on the left. When it has been run, a number will display indicating the order in which it was run in the notebook [8].

Enter edit mode by pressing Enter or using the mouse to click on a cell's editor area. Edit mode is indicated by a green cell border and a prompt showing in the editor area.

Data Preprocessing

In this IPython notebook, we will cover some of the useful data preprocessing methods like data cleaning and data resampling.

1. Data Cleaning

When you are working with raw data, instances of duplicate data, missing data or inconsistent data can occur. If the data is not cleaned then the trading strategy can give misleading performance results during the backtest or it can give incorrect buy/sell signals while trading.

Dealing with Duplicate Data

Duplicate data means same data values getting repeated for certain timestamps. In Python, we can detect the occurrence of duplicate timestamps using the duplicated method and remove them using the drop_duplicates method.

# Read the simulated 1-minute OHLC data using pandas read_csv method. # We specific the 'DATE' column as the index using the index_col argument. import pandas as pd df = pd.read_csv("Simulated_Duplicate Values.csv",index_col= 'DATE', usecols = ['DATE','OPEN','HIGH','LOW','CLOSE'], parse_dates=True) df
# Use the duplicated method to highlight the repeated timestamps df1 = df[df.duplicated(keep=False)] df1
# Use the drop_duplicates method to remove the duplicate timestamps. # In this example, it will remove the duplicate values for 09:16:00 and 09:20:00 df.drop_duplicates(inplace=True) df

Dealing with Missing Data

Missing data is another data quality issue; this can be hard to detect especially in case of shorter time frames like tick-by-tick data. On time frames like minute data or daily data one can build a check in the code to detect missing data.

# Read the simulated 1-minute OHLC data using pandas read_csv method. import pandas as pd df = pd.read_csv("Simulated_Missing Values.csv",index_col= 'DATE', usecols = ['DATE','OPEN','HIGH','LOW','CLOSE'], parse_dates=True) df

In Python, we can use the reindex method to detect missing data for the specific frequency. The NaN values can be filled using the previous values or by any other method.

df.reindex(pd.date_range(start=df.index[0], end=df.index[-1], freq='1min'))

Dealing with Inconsistent Data

Inconsistent data can occur in different forms, for example, a spike in the price series or volume. Checks and balances in the code can help detect such data inconsistency.

# Read the simulated 1-minute OHLC data using pandas read_csv method. import pandas as pd df = pd.read_csv("Simulated_Inconsistent Data.csv",index_col= 'DATE', usecols = ['DATE','OPEN','HIGH','LOW','CLOSE'], parse_dates=True) df

One can observe in the above table that there is a spike in the close price at timestamp of 09:18:00. To detect such spikes we can use a simple diff() method on the close price of the dataframe and specify the threshold. In the example below we have specified a threshold value of 25 to classify it as a spike.

The 25 is too low a values to use here.

df['CLOSE'].diff() > 25
DATE 2018-07-23 09:15:00 False 2018-07-23 09:16:00 False 2018-07-23 09:17:00 False 2018-07-23 09:18:00 True 2018-07-23 09:20:00 False 2018-07-23 09:21:00 False 2018-07-23 09:23:00 False 2018-07-23 09:24:00 False 2018-07-23 09:25:00 False Name: CLOSE, dtype: bool

2. Data Resampling

The Pandas ‘resample’ function can be used to convert a given time series in desired time frames like minutely, hourly, daily, or weekly. In this example, we will illustrate how to convert a 1-minute time series into a 3-minute time series.

# Read the simulated 1 minute OHLC data using pandas read_csv method. # We specific the 'DATE' column as the index using the index_col argument. import pandas as pd df1 = pd.read_csv("1 min - MARUTI.csv",index_col= 'DATE', usecols = ['DATE','OPEN','HIGH','LOW','CLOSE'], parse_dates=True) df1.head(8)

We use the resample function which takes two arguments to create a resampled series. The first argument, '3Min' specifies the time period to resample and the second argument, 'label' determines which bin edge label to label bucket with.

Since we are dealing with OHLC data, we would want to have the same format and meaning of OHLC in the resampled series. This means that the 'OPEN' price in the resampled series needs to correspond to the open price of the first 1-minute bar out of the three 1-minute bars used for resampling. The 'HIGH' in the resampled series needs to be equal to the highest price of the three 1-minute bars. Similarly, the 'LOW' in the resampled series needs to be equal to the lowest price of the three 1-minute bars. The 'CLOSE' price in the resamples series needs to correspond to the close price of the last 1-minute bar of the three 1-minute bars used for resampling.

This is achieved by using the aggregate method and specifying the requirements explained above to the aggregate method in the form of a Python dictionary.

from IPython.display import display_html df2 = df1.resample('3Min',label='right').agg({'OPEN': 'first','HIGH':'max','LOW':'min','CLOSE':'last'}) df2.head(8) df1_styler = df1.style.set_table_attributes("style='display:inline'").set_caption('Table 1') df2_styler = df2.style.set_table_attributes("style='display:inline'").set_caption('Table 2') display_html(df1_styler._repr_html_()+df2_styler._repr_html_(), raw=True) ## Table 1 shows the original 1-minute price series and the table 2 on the right shows the 3-minute resamples series.
WARNING: 1 intermediate output message was discarded.

These were some of the important methods which are used in data preprocessing. In the next unit, you can attempt the MCQs and try the interactive exercises.