Path: blob/master/projects/kaggle_rossman_store_sales/rossman_data_prep.ipynb
2619 views
Rossman Data Preparation
Individual Data Source
In addition to the data provided by the competition, we will be using external datasets put together by participants in the Kaggle competition. We can download all of them here. Then we should untar them in the directory to which data_dir is pointing to.
We turn state Holidays to booleans, to make them more convenient for modeling.
For the weather and state names data, we perform a join on a state name field and create a single dataframe.
For the google trend data. We're going to extract the state and date information from the raw dataset, also replace all instances of state name 'NI' to match the usage in the rest of the data: 'HB,NI'.
The following code chunks extracts particular date fields from a complete datetime for the purpose of constructing categoricals.
We should always consider this feature extraction step when working with date-time. Without expanding our date-time into these additional fields, we can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field.
The Google trends data has a special category for the whole of the Germany - we'll pull that out so we can use it explicitly.
Merging Various Data Source
Now we can outer join all of our data into a single dataframe. Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields. One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.
Aside: Why not just do an inner join? If we are assuming that all records are complete and match on the field we desire, an inner join will do the same thing as an outer join. However, in the event we are not sure, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is an equivalent approach).
During the merging process, we'll print out the first few rows of the dataframe and the column names so we can keep track of how the dataframe evolves as we join with a new data source.
Final Data
After merging all the various data source to create our master dataframe, we'll still perform some additional feature engineering steps including:
Some of the rows contain missing values for some columns, we'll impute them here. What values to impute is pretty subjective then we don't really know the root cause of why it is missing, we won't spend too much time on it here. One common strategy for imputing missing categorical features is to pick an arbitrary signal value that otherwise doesn't appear in the data, e.g. -1, -999. Or impute it with the mean, majority value and create another column that takes on a binary value indicating whether or not that value is missing in the first place.
Create some duration features with Competition and Promo column.
For the CompetitionMonthsOpen field, we limit the maximum to 2 years to limit the number of unique categories.
Repeat the same process for Promo
Durations
It is common when working with time series data to extract features that captures relationships across rows instead of between columns. e.g. time until next event, time since last event.
Here, we would like to compute features such as days until next promotion or days before next promotion. And the same process can be repeated for state/school holiday.
If we look at the values in the AfterStateHoliday column, we can see that the first row of the StateHoliday column is True, therefore, the corresponding AfterStateHoliday is therefore 0 indicating it's a state holiday that day, after encountering a state holiday, the AfterStateHoliday column will start incrementing until it sees the next StateHoliday, which will then reset this counter.
Note that for Promo, it starts out with a 0, but the AfterPromo starts accumulating until it sees the next Promo. Here, we're not exactly sure when was the last promo before 2013-01-01 since we don't have the data for it. Nonetheless we'll still start incrementing the counter. Another approach is to fill it all with 0.
After creating these new features, we join it back to the original dataframe.
We save the cleaned data so we won't have to repeat this data preparation step again.