GitHub Repository: ethen8181/machine-learning
Path: blob/master/projects/kaggle_rossman_store_sales/rossman_gbt.ipynb
²⁶³⁰ views

Kernel: Python 3

1 Rossman GBT Modeling
- 1.1 Data Preparation
- 1.2 Model Training

In [1]:

from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
themes = get_themes()
set_nb_theme(themes[3])

Out[1]:

In [2]:

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import json
import time
import numpy as np
import pandas as pd

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,pyarrow,sklearn

Out[2]:

Ethen 2019-08-09 13:25:04 

CPython 3.6.4
IPython 7.7.0

numpy 1.17.0
pandas 0.25.0
pyarrow 0.14.1
sklearn 0.21.2

Rossman GBT Modeling

Data Preparation

We've done most of our data preparation and feature engineering in the previous notebook, we'll still perform some additional ones here, but this notebook focuses on getting the data ready for fitting a Gradient Boosted Tree model. For the model, we will be leveraging lightgbm.

In [3]:

data_dir = 'cleaned_data'
path_train = os.path.join(data_dir, 'train_clean.parquet')
path_test = os.path.join(data_dir, 'test_clean.parquet')
engine = 'pyarrow'

df_train = pd.read_parquet(path_train, engine)
df_test = pd.read_parquet(path_test, engine)
print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()

Out[3]:

train dimension:  (1017209, 71)
test dimension:  (41088, 70)

We've pulled most of our configurable parameters outside into a json configuration file. In the ideal scenario, we can move all of our code into a python script and only change the configuration file to experiment with different type of settings to see which one leads to the best overall performance.

In [4]:

config_path = os.path.join('config', 'gbt_training_template.json')
with open(config_path) as f:
    config_file = json.load(f)
    
config_file

Out[4]:

{'columns': {'num_cols_pattern': ['CloudCover',
   'CompetitionDistance',
   'Max_Humidity',
   'Max_TemperatureC',
   'Max_Wind_SpeedKm_h',
   'Mean_Humidity',
   'Mean_TemperatureC',
   'Mean_Wind_SpeedKm_h',
   'Min_Humidity',
   'Min_TemperatureC',
   'Promo',
   'SchoolHoliday',
   'trend',
   'trend_DE',
   'AfterSchoolHoliday',
   'AfterStateHoliday',
   'AfterPromo',
   'BeforeSchoolHoliday',
   'BeforeStateHoliday',
   'BeforePromo'],
  'cat_cols_pattern': ['Assortment',
   'CompetitionMonthsOpen',
   'CompetitionOpenSinceYear',
   'Day',
   'DayOfWeek',
   'Events',
   'Month',
   'Promo2SinceYear',
   'Promo2Weeks',
   'PromoInterval',
   'State',
   'StateHoliday',
   'Store',
   'StoreType',
   'Week',
   'Year'],
  'id_cols': ['Id'],
  'label_col': 'Sales',
  'weights_col': None},
 'model_task': 'regression',
 'model_type': 'lgb',
 'model_parameters': {'lgb': {'n_jobs': -1,
   'learning_rate': 0.01,
   'n_estimators': 3000,
   'min_data_in_leaf': 100}},
 'model_hyper_parameters': {'lgb': {'max_depth': [3, 5, 8, 10, 12],
   'colsampl_bytree': [0.7, 0.8, 0.9],
   'subsample': [0.7, 0.8, 0.9]}},
 'model_fit_parameters': {'lgb': {'eval_metric': 'l2',
   'early_stopping_rounds': 5,
   'verbose': 100}},
 'search_parameters': {'n_iter': 3,
  'n_jobs': -1,
  'verbose': 1,
  'scoring': 'neg_mean_squared_error',
  'random_state': 1234,
  'return_train_score': True}}

In [5]:

# extract settings from the configuration file into local variables
columns = config_file['columns']
num_cols = columns['num_cols_pattern']
cat_cols = columns['cat_cols_pattern']
id_cols = columns['id_cols']
label_col = columns['label_col']
weights_col = columns['weights_col']

model_task = config_file['model_task']
model_type = config_file['model_type']
model_parameters = config_file['model_parameters'][model_type]
model_hyper_parameters = config_file['model_hyper_parameters'][model_type]
model_fit_parameters = config_file['model_fit_parameters'][model_type]
search_parameters = config_file['search_parameters']

Here, we will remove all records where the store had zero sale / was closed (feel free to experiment with not excluding the zero sales record and see if improves performance)

We also perform a train/validation split. The validation split will be used in our hyper-parameter tuning process and for early stopping. Notice that because this is a time series application, where we are trying to predict different stores' daily sales. It's important to not perform a random train/test split, but instead divide the training and validation set based on time/date.

Our training data is already sorted by date in decreasing order, hence we can create the validation set by checking how big is our test set and select the top-N observations to create a validation set that has similar size to our test set. Here we're saying similar size and not exact size, because we make sure that all the records from the same date falls under either training or validation set.

In [6]:

df_train = df_train[df_train[label_col] != 0].reset_index(drop=True)

mask = df_train['Date'] == df_train['Date'].iloc[len(df_test)]
val_index = df_train.loc[mask, 'Date'].index.max()
val_index

Out[6]:

41395

The validation fold we're creating is used for sklearn's PredefinedSplit, where we set the index to 0 for all samples that are part of the validation set, and to -1 for all other samples.

In [7]:

val_fold = np.full(df_train.shape[0], fill_value=-1)
val_fold[:(val_index + 1)] = 0
val_fold

Out[7]:

array([ 0,  0,  0, ..., -1, -1, -1])

Here, we assign the validation fold back to the original dataframe to illustrate the point, this is technically not required for the rest of the pipeline. Notice in the dataframe that we've printed out, the last record's date, 2015-06-18 is different from the rest, and the record's val_fold takes on a value of -1. This means that all records including/after the date 2015-06-19 will become our validation set.

In [8]:

df_train['val_fold'] = val_fold
df_train[(val_index - 2):(val_index + 2)]

Out[8]:

We proceed to extracting the necessary columns both numerical and categorical that we'll use for modeling.

In [9]:

# the model id is used as the indicator when saving the model
model_id = 'gbt'
input_cols = num_cols + cat_cols

df_train = df_train[input_cols + [label_col]]

# we will perform the modeling at the log-scale
df_train[label_col] = np.log(df_train[label_col])
df_test = df_test[input_cols + id_cols]

print('train dimension: ', df_train.shape)
print('test dimension: ', df_test.shape)
df_train.head()

Out[9]:

train dimension:  (844338, 37)
test dimension:  (41088, 37)

In [10]:

for cat_col in cat_cols:
    df_train[cat_col] = df_train[cat_col].astype('category')
    df_test[cat_col] = df_test[cat_col].astype('category')

df_train.head()

Out[10]:

Model Training

We use a helper class to train a boosted tree model, generate the prediction on our test set, create the submission file, check the feature importance of the tree-based model and also make sure we can save and re-load the model.

In [11]:

from gbt_module.model import GBTPipeline

model = GBTPipeline(input_cols, cat_cols, label_col, weights_col,
                    model_task, model_id, model_type, model_parameters,
                    model_hyper_parameters, search_parameters)
model

Out[11]:

GBTPipeline(cat_cols=['Assortment', 'CompetitionMonthsOpen',
                      'CompetitionOpenSinceYear', 'Day', 'DayOfWeek', 'Events',
                      'Month', 'Promo2SinceYear', 'Promo2Weeks',
                      'PromoInterval', 'State', 'StateHoliday', 'Store',
                      'StoreType', 'Week', 'Year'],
            input_cols=['CloudCover', 'CompetitionDistance', 'Max_Humidity',
                        'Max_TemperatureC', 'Max_Wind_SpeedKm_h',
                        'Mean_Humidity', 'Mean...
                                    'max_depth': [3, 5, 8, 10, 12],
                                    'subsample': [0.7, 0.8, 0.9]},
            model_id='gbt',
            model_parameters={'learning_rate': 0.01, 'min_data_in_leaf': 100,
                              'n_estimators': 3000, 'n_jobs': -1},
            model_task='regression', model_type='lgb',
            search_parameters={'n_iter': 3, 'n_jobs': -1, 'random_state': 1234,
                               'return_train_score': True,
                               'scoring': 'neg_mean_squared_error',
                               'verbose': 1},
            weights_col=None)

In [12]:

start = time.time()
model.fit(df_train, val_fold, model_fit_parameters)
elapsed = time.time() - start
print('elapsed minutes: ', elapsed / 60)

Out[12]:

Fitting 1 folds for each of 3 candidates, totalling 3 fits

/Users/mingyuliu/anaconda3/lib/python3.6/site-packages/lightgbm/__init__.py:46: UserWarning: Starting from version 2.2.1, the library file in distribution wheels for macOS is built by the Apple Clang (Xcode_8.3.3) compiler.
This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.
  "You can install the OpenMP library by the following command: ``brew install libomp``.", UserWarning)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 20.0min finished
/Users/mingyuliu/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py:1209: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['Assortment', 'CompetitionMonthsOpen', 'CompetitionOpenSinceYear', 'Day', 'DayOfWeek', 'Events', 'Month', 'Promo2SinceYear', 'Promo2Weeks', 'PromoInterval', 'State', 'StateHoliday', 'Store', 'StoreType', 'Week', 'Year']
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))

Training until validation scores don't improve for 5 rounds.
[100]	valid_0's l2: 0.0718742	valid_0's l2: 0.0718742	valid_1's l2: 0.0654301	valid_1's l2: 0.0654301
[200]	valid_0's l2: 0.0450121	valid_0's l2: 0.0450121	valid_1's l2: 0.0404465	valid_1's l2: 0.0404465
[300]	valid_0's l2: 0.0336353	valid_0's l2: 0.0336353	valid_1's l2: 0.0314702	valid_1's l2: 0.0314702
[400]	valid_0's l2: 0.027701	valid_0's l2: 0.027701	valid_1's l2: 0.0269131	valid_1's l2: 0.0269131
[500]	valid_0's l2: 0.0240796	valid_0's l2: 0.0240796	valid_1's l2: 0.0240597	valid_1's l2: 0.0240597
[600]	valid_0's l2: 0.0218724	valid_0's l2: 0.0218724	valid_1's l2: 0.0218025	valid_1's l2: 0.0218025
[700]	valid_0's l2: 0.0201603	valid_0's l2: 0.0201603	valid_1's l2: 0.0201185	valid_1's l2: 0.0201185
[800]	valid_0's l2: 0.0184851	valid_0's l2: 0.0184851	valid_1's l2: 0.0184089	valid_1's l2: 0.0184089
[900]	valid_0's l2: 0.0168288	valid_0's l2: 0.0168288	valid_1's l2: 0.0166787	valid_1's l2: 0.0166787
[1000]	valid_0's l2: 0.015335	valid_0's l2: 0.015335	valid_1's l2: 0.0151688	valid_1's l2: 0.0151688
[1100]	valid_0's l2: 0.0142658	valid_0's l2: 0.0142658	valid_1's l2: 0.0141122	valid_1's l2: 0.0141122
[1200]	valid_0's l2: 0.0135508	valid_0's l2: 0.0135508	valid_1's l2: 0.013411	valid_1's l2: 0.013411
[1300]	valid_0's l2: 0.012954	valid_0's l2: 0.012954	valid_1's l2: 0.0128038	valid_1's l2: 0.0128038
[1400]	valid_0's l2: 0.0124204	valid_0's l2: 0.0124204	valid_1's l2: 0.0122977	valid_1's l2: 0.0122977
[1500]	valid_0's l2: 0.0119961	valid_0's l2: 0.0119961	valid_1's l2: 0.011941	valid_1's l2: 0.011941
[1600]	valid_0's l2: 0.01163	valid_0's l2: 0.01163	valid_1's l2: 0.0115995	valid_1's l2: 0.0115995
[1700]	valid_0's l2: 0.0113139	valid_0's l2: 0.0113139	valid_1's l2: 0.0112803	valid_1's l2: 0.0112803
[1800]	valid_0's l2: 0.0110184	valid_0's l2: 0.0110184	valid_1's l2: 0.0109884	valid_1's l2: 0.0109884
[1900]	valid_0's l2: 0.0107764	valid_0's l2: 0.0107764	valid_1's l2: 0.0107527	valid_1's l2: 0.0107527
[2000]	valid_0's l2: 0.0105697	valid_0's l2: 0.0105697	valid_1's l2: 0.0105339	valid_1's l2: 0.0105339
[2100]	valid_0's l2: 0.0103475	valid_0's l2: 0.0103475	valid_1's l2: 0.0103047	valid_1's l2: 0.0103047
[2200]	valid_0's l2: 0.0101752	valid_0's l2: 0.0101752	valid_1's l2: 0.0101483	valid_1's l2: 0.0101483
[2300]	valid_0's l2: 0.0100167	valid_0's l2: 0.0100167	valid_1's l2: 0.00999843	valid_1's l2: 0.00999843
[2400]	valid_0's l2: 0.00986898	valid_0's l2: 0.00986898	valid_1's l2: 0.00983577	valid_1's l2: 0.00983577
[2500]	valid_0's l2: 0.00972949	valid_0's l2: 0.00972949	valid_1's l2: 0.00969056	valid_1's l2: 0.00969056
[2600]	valid_0's l2: 0.00960752	valid_0's l2: 0.00960752	valid_1's l2: 0.00955977	valid_1's l2: 0.00955977
[2700]	valid_0's l2: 0.00948489	valid_0's l2: 0.00948489	valid_1's l2: 0.00944049	valid_1's l2: 0.00944049
[2800]	valid_0's l2: 0.00936427	valid_0's l2: 0.00936427	valid_1's l2: 0.00928775	valid_1's l2: 0.00928775
[2900]	valid_0's l2: 0.00924005	valid_0's l2: 0.00924005	valid_1's l2: 0.00915616	valid_1's l2: 0.00915616
[3000]	valid_0's l2: 0.00912308	valid_0's l2: 0.00912308	valid_1's l2: 0.00904488	valid_1's l2: 0.00904488
Did not meet early stopping. Best iteration is:
[3000]	valid_0's l2: 0.00912308	valid_0's l2: 0.00912308	valid_1's l2: 0.00904488	valid_1's l2: 0.00904488
elapsed minutes:  22.69019781748454

In [13]:

pd.DataFrame(model.model_tuned_.cv_results_)

Out[13]:

In [14]:

# we logged our label, remember to exponentiate it back to the original scale
prediction_test = model.predict(df_test[input_cols])
df_test[label_col] = np.exp(prediction_test)

submission_cols = id_cols + [label_col]
df_test[submission_cols] = df_test[submission_cols].astype('int')

submission_dir = 'submission'
if not os.path.isdir(submission_dir):
    os.makedirs(submission_dir, exist_ok=True)

submission_file = 'rossmann_submission_{}.csv'.format(model_id)
submission_path = os.path.join(submission_dir, submission_file)
df_test[submission_cols].to_csv(submission_path, index=False)

df_test[submission_cols].head()

Out[14]:

In [15]:

model.get_feature_importance()

Out[15]:

[('Store', 0.6707),
 ('Promo', 0.1081),
 ('BeforePromo', 0.0653),
 ('DayOfWeek', 0.0574),
 ('Week', 0.0463),
 ('Day', 0.019),
 ('AfterStateHoliday', 0.0049),
 ('BeforeStateHoliday', 0.0045),
 ('CompetitionDistance', 0.0041),
 ('StoreType', 0.0039),
 ('Month', 0.003),
 ('Year', 0.0022),
 ('State', 0.0016),
 ('CompetitionOpenSinceYear', 0.0015),
 ('AfterPromo', 0.001)]

In [16]:

model_checkpoint = os.path.join('models', model_id + '.pkl')
model.save(model_checkpoint)

loaded_model = GBTPipeline.load(model_checkpoint)

# print the cv_results_ again to ensure the checkpointing works
pd.DataFrame(loaded_model.model_tuned_.cv_results_)

Out[16]:

Table of Contents

Rossman GBT Modeling

Data Preparation

Model Training

Product

Resources

Company