Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ethen8181
GitHub Repository: ethen8181/machine-learning
Path: blob/master/projects/kaggle_rossman_store_sales/rossman_data_prep.ipynb
2619 views
Kernel: Python 3
from jupyterthemes import get_themes from jupyterthemes.stylefx import set_nb_theme themes = get_themes() set_nb_theme(themes[3])
# 1. magic for inline plot # 2. magic to print version # 3. magic so that the notebook will reload external python modules # 4. magic to enable retina (high resolution) plots # https://gist.github.com/minrk/3301035 %matplotlib inline %load_ext watermark %load_ext autoreload %autoreload 2 %config InlineBackend.figure_format='retina' import os import time import numba import numpy as np import pandas as pd %watermark -a 'Ethen' -d -t -v -p numpy,pandas,numba
Ethen 2019-08-09 12:21:33 CPython 3.6.4 IPython 7.7.0 numpy 1.17.0 pandas 0.25.0 numba 0.37.0

Rossman Data Preparation

Individual Data Source

In addition to the data provided by the competition, we will be using external datasets put together by participants in the Kaggle competition. We can download all of them here. Then we should untar them in the directory to which data_dir is pointing to.

data_dir = 'rossmann' print('available files: ', os.listdir(data_dir)) file_names = ['train', 'store', 'store_states', 'state_names', 'googletrend', 'weather', 'test'] path_names = {file_name: os.path.join(data_dir, file_name + '.csv') for file_name in file_names} df_train = pd.read_csv(path_names['train'], low_memory=False) df_test = pd.read_csv(path_names['test'], low_memory=False) print('training data dimension: ', df_train.shape) print('testing data dimension: ', df_test.shape) df_train.head()
available files: ['state_names.csv', 'googletrend.csv', 'test.csv', 'weather.csv', 'train.csv', 'store_states.csv', 'store.csv', 'sample_submission.csv'] training data dimension: (1017209, 9) testing data dimension: (41088, 8)

We turn state Holidays to booleans, to make them more convenient for modeling.

df_train['StateHoliday'] = df_train['StateHoliday'] != '0' df_test['StateHoliday'] = df_test['StateHoliday'] != '0'

For the weather and state names data, we perform a join on a state name field and create a single dataframe.

df_weather = pd.read_csv(path_names['weather']) print('weather data dimension: ', df_weather.shape) df_weather.head()
weather data dimension: (15840, 24)
df_state_names = pd.read_csv(path_names['state_names']) print('state names data dimension: ', df_state_names.shape) df_state_names.head()
state names data dimension: (16, 2)
df_weather = df_weather.rename(columns={'file': 'StateName'}) df_weather = df_weather.merge(df_state_names, on="StateName", how='left') df_weather.head()

For the google trend data. We're going to extract the state and date information from the raw dataset, also replace all instances of state name 'NI' to match the usage in the rest of the data: 'HB,NI'.

df_googletrend = pd.read_csv(path_names['googletrend']) print('google trend data dimension: ', df_googletrend.shape) df_googletrend.head()
google trend data dimension: (2072, 3)
df_googletrend['Date'] = df_googletrend['week'].str.split(' - ', expand=True)[0] df_googletrend['State'] = df_googletrend['file'].str.split('_', expand=True)[2] df_googletrend.loc[df_googletrend['State'] == 'NI', 'State'] = 'HB,NI' df_googletrend.head()

The following code chunks extracts particular date fields from a complete datetime for the purpose of constructing categoricals.

We should always consider this feature extraction step when working with date-time. Without expanding our date-time into these additional fields, we can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field.

DEFAULT_DT_ATTRIBUTES = [ 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start' ] def add_datepart(df, colname, drop_original_col=False, dt_attributes=DEFAULT_DT_ATTRIBUTES, add_elapse_col=True): """ Extract various date time components out of a date column, this modifies the dataframe inplace. References ---------- - https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components """ df[colname] = pd.to_datetime(df[colname], infer_datetime_format=True) if dt_attributes: for attr in dt_attributes: df[attr] = getattr(df[colname].dt, attr.lower()) # representing the number of seconds elapsed from 1970-01-01 00:00:00 # https://stackoverflow.com/questions/15203623/convert-pandas-datetimeindex-to-unix-time if add_elapse_col: df['Elapsed'] = df[colname].astype(np.int64) // 10 ** 9 if drop_original_col: df = df.drop(colname, axis=1) return df
df_weather.head()
df_weather = add_datepart( df_weather, 'Date', dt_attributes=None, add_elapse_col=False) df_googletrend = add_datepart( df_googletrend, 'Date', drop_original_col=True, dt_attributes=['Year', 'Week'], add_elapse_col=False) df_train = add_datepart(df_train, 'Date') df_test = add_datepart(df_test, 'Date') print('training data dimension: ', df_train.shape) df_train.head()
training data dimension: (1017209, 22)

The Google trends data has a special category for the whole of the Germany - we'll pull that out so we can use it explicitly.

df_trend_de = df_googletrend.loc[df_googletrend['file'] == 'Rossmann_DE', ['Year', 'Week', 'trend']] df_trend_de.head()

Merging Various Data Source

Now we can outer join all of our data into a single dataframe. Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields. One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.

Aside: Why not just do an inner join? If we are assuming that all records are complete and match on the field we desire, an inner join will do the same thing as an outer join. However, in the event we are not sure, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is an equivalent approach).

During the merging process, we'll print out the first few rows of the dataframe and the column names so we can keep track of how the dataframe evolves as we join with a new data source.

df_store = pd.read_csv(path_names['store']) print('store data dimension: ', df_store.shape) df_store.head()
store data dimension: (1115, 10)
df_store_states = pd.read_csv(path_names['store_states']) print('store states data dimension: ', df_store_states.shape) df_store_states.head()
store states data dimension: (1115, 2)
df_store = df_store.merge(df_store_states, on='Store', how='left') print('null count: ', len(df_store[df_store['State'].isnull()])) df_store.head()
null count: 0
df_joined_train = df_train.merge(df_store, on='Store', how='left') df_joined_test = df_test.merge(df_store, on='Store', how='left') null_count_train = len(df_joined_train[df_joined_train['StoreType'].isnull()]) null_count_test = len(df_joined_test[df_joined_test['StoreType'].isnull()]) print('null count: ', null_count_train, null_count_test) print('dimension: ', df_joined_train.shape) df_joined_train.head()
null count: 0 0 dimension: (1017209, 32)
df_joined_train.columns
Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start', 'Elapsed', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'State'], dtype='object')
df_joined_train = df_joined_train.merge(df_weather, on=['State', 'Date'], how='left') df_joined_test = df_joined_test.merge(df_weather, on=['State', 'Date'], how='left') null_count_train = len(df_joined_train[df_joined_train['Mean_TemperatureC'].isnull()]) null_count_test = len(df_joined_test[df_joined_test['Mean_TemperatureC'].isnull()]) print('null count: ', null_count_train, null_count_test) print('dimension: ', df_joined_train.shape) df_joined_train.head()
null count: 0 0 dimension: (1017209, 55)
df_joined_train.columns
Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start', 'Elapsed', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'State', 'StateName', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC', 'Dew_PointC', 'MeanDew_PointC', 'Min_DewpointC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Sea_Level_PressurehPa', 'Mean_Sea_Level_PressurehPa', 'Min_Sea_Level_PressurehPa', 'Max_VisibilityKm', 'Mean_VisibilityKm', 'Min_VisibilitykM', 'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'Max_Gust_SpeedKm_h', 'Precipitationmm', 'CloudCover', 'Events', 'WindDirDegrees'], dtype='object')
df_joined_train = df_joined_train.merge(df_googletrend, on=['State', 'Year', 'Week'], how='left') df_joined_test = df_joined_test.merge(df_googletrend, on=['State', 'Year', 'Week'], how='left') null_count_train = len(df_joined_train[df_joined_train['trend'].isnull()]) null_count_test = len(df_joined_test[df_joined_test['trend'].isnull()]) print('null count: ', null_count_train, null_count_test) print('dimension: ', df_joined_train.shape) df_joined_train.head()
null count: 0 0 dimension: (1017209, 58)
df_joined_train.columns
Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start', 'Elapsed', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'State', 'StateName', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC', 'Dew_PointC', 'MeanDew_PointC', 'Min_DewpointC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Sea_Level_PressurehPa', 'Mean_Sea_Level_PressurehPa', 'Min_Sea_Level_PressurehPa', 'Max_VisibilityKm', 'Mean_VisibilityKm', 'Min_VisibilitykM', 'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'Max_Gust_SpeedKm_h', 'Precipitationmm', 'CloudCover', 'Events', 'WindDirDegrees', 'file', 'week', 'trend'], dtype='object')
df_joined_train = df_joined_train.merge(df_trend_de, on=['Year', 'Week'], suffixes=('', '_DE'), how='left') df_joined_test = df_joined_test.merge(df_trend_de, on=['Year', 'Week'], suffixes=('', '_DE'), how='left') null_count_train = len(df_joined_train[df_joined_train['trend_DE'].isnull()]) null_count_test = len(df_joined_test[df_joined_test['trend_DE'].isnull()]) print('null count: ', null_count_train, null_count_test) print('dimension: ', df_joined_train.shape) df_joined_train.head()
null count: 0 0 dimension: (1017209, 59)
df_joined_train.columns
Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start', 'Elapsed', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'State', 'StateName', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC', 'Dew_PointC', 'MeanDew_PointC', 'Min_DewpointC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Sea_Level_PressurehPa', 'Mean_Sea_Level_PressurehPa', 'Min_Sea_Level_PressurehPa', 'Max_VisibilityKm', 'Mean_VisibilityKm', 'Min_VisibilitykM', 'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'Max_Gust_SpeedKm_h', 'Precipitationmm', 'CloudCover', 'Events', 'WindDirDegrees', 'file', 'week', 'trend', 'trend_DE'], dtype='object')

Final Data

After merging all the various data source to create our master dataframe, we'll still perform some additional feature engineering steps including:

  • Some of the rows contain missing values for some columns, we'll impute them here. What values to impute is pretty subjective then we don't really know the root cause of why it is missing, we won't spend too much time on it here. One common strategy for imputing missing categorical features is to pick an arbitrary signal value that otherwise doesn't appear in the data, e.g. -1, -999. Or impute it with the mean, majority value and create another column that takes on a binary value indicating whether or not that value is missing in the first place.

  • Create some duration features with Competition and Promo column.

for df in (df_joined_train, df_joined_test): df['CompetitionOpenSinceYear'] = (df['CompetitionOpenSinceYear'] .fillna(1900) .astype(np.int32)) df['CompetitionOpenSinceMonth'] = (df['CompetitionOpenSinceMonth'] .fillna(1) .astype(np.int32)) df['Promo2SinceYear'] = df['Promo2SinceYear'].fillna(1900).astype(np.int32) df['Promo2SinceWeek'] = df['Promo2SinceWeek'].fillna(1).astype(np.int32)
for df in (df_joined_train, df_joined_test): df['CompetitionOpenSince'] = pd.to_datetime(dict( year=df['CompetitionOpenSinceYear'], month=df['CompetitionOpenSinceMonth'], day=15 )) df['CompetitionDaysOpen'] = df['Date'].subtract(df['CompetitionOpenSince']).dt.days

For the CompetitionMonthsOpen field, we limit the maximum to 2 years to limit the number of unique categories.

for df in (df_joined_train, df_joined_test): df['CompetitionMonthsOpen'] = df['CompetitionDaysOpen'] // 30 df.loc[df['CompetitionMonthsOpen'] > 24, 'CompetitionMonthsOpen'] = 24 df.loc[df['CompetitionMonthsOpen'] < -24, 'CompetitionMonthsOpen'] = -24 df_joined_train['CompetitionMonthsOpen'].unique()
array([ 24, 3, 19, 9, 16, 17, 7, 15, 22, 11, 13, 2, 23, 0, 12, 4, 10, 1, 14, 20, 8, 18, -1, 6, 21, 5, -2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, -17, -18, -19, -20, -21, -22, -23, -24])

Repeat the same process for Promo

from isoweek import Week for df in (df_joined_train, df_joined_test): df['Promo2Since'] = pd.to_datetime(df.apply(lambda x: Week( x.Promo2SinceYear, x.Promo2SinceWeek).monday(), axis=1)) df['Promo2Days'] = df['Date'].subtract(df['Promo2Since']).dt.days
for df in (df_joined_train, df_joined_test): df['Promo2Weeks'] = df['Promo2Days'] // 7 df.loc[df['Promo2Weeks'] < 0, 'Promo2Weeks'] = 0 df.loc[df['Promo2Weeks'] > 25, 'Promo2Weeks'] = 25 df_joined_train['Promo2Weeks'].unique()
array([25, 17, 8, 13, 24, 16, 7, 12, 23, 15, 6, 11, 22, 14, 5, 10, 21, 4, 9, 20, 3, 19, 2, 18, 1, 0])
df_joined_train.columns
Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start', 'Elapsed', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'State', 'StateName', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC', 'Dew_PointC', 'MeanDew_PointC', 'Min_DewpointC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Sea_Level_PressurehPa', 'Mean_Sea_Level_PressurehPa', 'Min_Sea_Level_PressurehPa', 'Max_VisibilityKm', 'Mean_VisibilityKm', 'Min_VisibilitykM', 'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'Max_Gust_SpeedKm_h', 'Precipitationmm', 'CloudCover', 'Events', 'WindDirDegrees', 'file', 'week', 'trend', 'trend_DE', 'CompetitionOpenSince', 'CompetitionDaysOpen', 'CompetitionMonthsOpen', 'Promo2Since', 'Promo2Days', 'Promo2Weeks'], dtype='object')

Durations

It is common when working with time series data to extract features that captures relationships across rows instead of between columns. e.g. time until next event, time since last event.

Here, we would like to compute features such as days until next promotion or days before next promotion. And the same process can be repeated for state/school holiday.

columns = ['Date', 'Store', 'Promo', 'StateHoliday', 'SchoolHoliday'] df = df_joined_train[columns].append(df_joined_test[columns]) df['DateUnixSeconds'] = df['Date'].astype(np.int64) // 10 ** 9 df.head()
@numba.njit def compute_duration(store_arr, date_unix_seconds_arr, field_arr): """ For each store, track the day since/before the occurrence of a field. The store and date are assumed to be already sorted. Parameters ---------- store_arr : 1d ndarray[int] date_unix_seconds_arr : 1d ndarray[int] The date should be represented in unix timestamp (seconds). field_arr : 1d ndarray[bool]/ndarray[int] The field that we're interested in. If int, it should take value of 1/0 indicating whether the field/event occurred or not. Returns ------- result : list[int] Days since/before the occurrence of a field. """ result = [] last_store = 0 zipped = zip(store_arr, date_unix_seconds_arr, field_arr) for store, date_unix_seconds, field in zipped: if store != last_store: last_store = store last_date = date_unix_seconds if field: last_date = date_unix_seconds diff_day = (date_unix_seconds - last_date) // 86400 result.append(diff_day) return result
df = df.sort_values(['Store', 'Date']) start = time.time() for col in ('SchoolHoliday', 'StateHoliday', 'Promo'): result = compute_duration(df['Store'].values, df['DateUnixSeconds'].values, df[col].values) df['After' + col] = result end = time.time() print('elapsed: ', end - start) df.head(10)
elapsed: 0.97281813621521

If we look at the values in the AfterStateHoliday column, we can see that the first row of the StateHoliday column is True, therefore, the corresponding AfterStateHoliday is therefore 0 indicating it's a state holiday that day, after encountering a state holiday, the AfterStateHoliday column will start incrementing until it sees the next StateHoliday, which will then reset this counter.

Note that for Promo, it starts out with a 0, but the AfterPromo starts accumulating until it sees the next Promo. Here, we're not exactly sure when was the last promo before 2013-01-01 since we don't have the data for it. Nonetheless we'll still start incrementing the counter. Another approach is to fill it all with 0.

df = df.sort_values(['Store', 'Date'], ascending=[True, False]) start = time.time() for col in ('SchoolHoliday', 'StateHoliday', 'Promo'): result = compute_duration(df['Store'].values, df['DateUnixSeconds'].values, df[col].values) df['Before' + col] = result end = time.time() print('elapsed: ', end - start) df.head(10)
elapsed: 0.7040410041809082

After creating these new features, we join it back to the original dataframe.

df = df.drop(['Promo', 'StateHoliday', 'SchoolHoliday', 'DateUnixSeconds'], axis=1) df_joined_train = df_joined_train.merge(df, on=['Date', 'Store'], how='inner') df_joined_test = df_joined_test.merge(df, on=['Date', 'Store'], how='inner') print('dimension: ', df_joined_train.shape) df_joined_train.head()
dimension: (1017209, 71)
df_joined_train.columns
Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start', 'Elapsed', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'State', 'StateName', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC', 'Dew_PointC', 'MeanDew_PointC', 'Min_DewpointC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Sea_Level_PressurehPa', 'Mean_Sea_Level_PressurehPa', 'Min_Sea_Level_PressurehPa', 'Max_VisibilityKm', 'Mean_VisibilityKm', 'Min_VisibilitykM', 'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'Max_Gust_SpeedKm_h', 'Precipitationmm', 'CloudCover', 'Events', 'WindDirDegrees', 'file', 'week', 'trend', 'trend_DE', 'CompetitionOpenSince', 'CompetitionDaysOpen', 'CompetitionMonthsOpen', 'Promo2Since', 'Promo2Days', 'Promo2Weeks', 'AfterSchoolHoliday', 'AfterStateHoliday', 'AfterPromo', 'BeforeSchoolHoliday', 'BeforeStateHoliday', 'BeforePromo'], dtype='object')

We save the cleaned data so we won't have to repeat this data preparation step again.

output_dir = 'cleaned_data' if not os.path.isdir(output_dir): os.makedirs(output_dir, exist_ok=True) engine = 'pyarrow' output_path_train = os.path.join(output_dir, 'train_clean.parquet') output_path_test = os.path.join(output_dir, 'test_clean.parquet') df_joined_train.to_parquet(output_path_train, engine=engine) df_joined_test.to_parquet(output_path_test, engine=engine)

Reference