Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ethen8181
GitHub Repository: ethen8181/machine-learning
Path: blob/master/projects/kaggle_rossman_store_sales/rossman_gbt.ipynb
2630 views
Kernel: Python 3
from jupyterthemes import get_themes from jupyterthemes.stylefx import set_nb_theme themes = get_themes() set_nb_theme(themes[3])
# 1. magic for inline plot # 2. magic to print version # 3. magic so that the notebook will reload external python modules # 4. magic to enable retina (high resolution) plots # https://gist.github.com/minrk/3301035 %matplotlib inline %load_ext watermark %load_ext autoreload %autoreload 2 %config InlineBackend.figure_format='retina' import os import json import time import numpy as np import pandas as pd %watermark -a 'Ethen' -d -t -v -p numpy,pandas,pyarrow,sklearn
Ethen 2019-08-09 13:25:04 CPython 3.6.4 IPython 7.7.0 numpy 1.17.0 pandas 0.25.0 pyarrow 0.14.1 sklearn 0.21.2

Rossman GBT Modeling

Data Preparation

We've done most of our data preparation and feature engineering in the previous notebook, we'll still perform some additional ones here, but this notebook focuses on getting the data ready for fitting a Gradient Boosted Tree model. For the model, we will be leveraging lightgbm.

data_dir = 'cleaned_data' path_train = os.path.join(data_dir, 'train_clean.parquet') path_test = os.path.join(data_dir, 'test_clean.parquet') engine = 'pyarrow' df_train = pd.read_parquet(path_train, engine) df_test = pd.read_parquet(path_test, engine) print('train dimension: ', df_train.shape) print('test dimension: ', df_test.shape) df_train.head()
train dimension: (1017209, 71) test dimension: (41088, 70)

We've pulled most of our configurable parameters outside into a json configuration file. In the ideal scenario, we can move all of our code into a python script and only change the configuration file to experiment with different type of settings to see which one leads to the best overall performance.

config_path = os.path.join('config', 'gbt_training_template.json') with open(config_path) as f: config_file = json.load(f) config_file
{'columns': {'num_cols_pattern': ['CloudCover', 'CompetitionDistance', 'Max_Humidity', 'Max_TemperatureC', 'Max_Wind_SpeedKm_h', 'Mean_Humidity', 'Mean_TemperatureC', 'Mean_Wind_SpeedKm_h', 'Min_Humidity', 'Min_TemperatureC', 'Promo', 'SchoolHoliday', 'trend', 'trend_DE', 'AfterSchoolHoliday', 'AfterStateHoliday', 'AfterPromo', 'BeforeSchoolHoliday', 'BeforeStateHoliday', 'BeforePromo'], 'cat_cols_pattern': ['Assortment', 'CompetitionMonthsOpen', 'CompetitionOpenSinceYear', 'Day', 'DayOfWeek', 'Events', 'Month', 'Promo2SinceYear', 'Promo2Weeks', 'PromoInterval', 'State', 'StateHoliday', 'Store', 'StoreType', 'Week', 'Year'], 'id_cols': ['Id'], 'label_col': 'Sales', 'weights_col': None}, 'model_task': 'regression', 'model_type': 'lgb', 'model_parameters': {'lgb': {'n_jobs': -1, 'learning_rate': 0.01, 'n_estimators': 3000, 'min_data_in_leaf': 100}}, 'model_hyper_parameters': {'lgb': {'max_depth': [3, 5, 8, 10, 12], 'colsampl_bytree': [0.7, 0.8, 0.9], 'subsample': [0.7, 0.8, 0.9]}}, 'model_fit_parameters': {'lgb': {'eval_metric': 'l2', 'early_stopping_rounds': 5, 'verbose': 100}}, 'search_parameters': {'n_iter': 3, 'n_jobs': -1, 'verbose': 1, 'scoring': 'neg_mean_squared_error', 'random_state': 1234, 'return_train_score': True}}
# extract settings from the configuration file into local variables columns = config_file['columns'] num_cols = columns['num_cols_pattern'] cat_cols = columns['cat_cols_pattern'] id_cols = columns['id_cols'] label_col = columns['label_col'] weights_col = columns['weights_col'] model_task = config_file['model_task'] model_type = config_file['model_type'] model_parameters = config_file['model_parameters'][model_type] model_hyper_parameters = config_file['model_hyper_parameters'][model_type] model_fit_parameters = config_file['model_fit_parameters'][model_type] search_parameters = config_file['search_parameters']

Here, we will remove all records where the store had zero sale / was closed (feel free to experiment with not excluding the zero sales record and see if improves performance)

We also perform a train/validation split. The validation split will be used in our hyper-parameter tuning process and for early stopping. Notice that because this is a time series application, where we are trying to predict different stores' daily sales. It's important to not perform a random train/test split, but instead divide the training and validation set based on time/date.

Our training data is already sorted by date in decreasing order, hence we can create the validation set by checking how big is our test set and select the top-N observations to create a validation set that has similar size to our test set. Here we're saying similar size and not exact size, because we make sure that all the records from the same date falls under either training or validation set.

df_train = df_train[df_train[label_col] != 0].reset_index(drop=True) mask = df_train['Date'] == df_train['Date'].iloc[len(df_test)] val_index = df_train.loc[mask, 'Date'].index.max() val_index
41395

The validation fold we're creating is used for sklearn's PredefinedSplit, where we set the index to 0 for all samples that are part of the validation set, and to -1 for all other samples.

val_fold = np.full(df_train.shape[0], fill_value=-1) val_fold[:(val_index + 1)] = 0 val_fold
array([ 0, 0, 0, ..., -1, -1, -1])

Here, we assign the validation fold back to the original dataframe to illustrate the point, this is technically not required for the rest of the pipeline. Notice in the dataframe that we've printed out, the last record's date, 2015-06-18 is different from the rest, and the record's val_fold takes on a value of -1. This means that all records including/after the date 2015-06-19 will become our validation set.

df_train['val_fold'] = val_fold df_train[(val_index - 2):(val_index + 2)]

We proceed to extracting the necessary columns both numerical and categorical that we'll use for modeling.

# the model id is used as the indicator when saving the model model_id = 'gbt' input_cols = num_cols + cat_cols df_train = df_train[input_cols + [label_col]] # we will perform the modeling at the log-scale df_train[label_col] = np.log(df_train[label_col]) df_test = df_test[input_cols + id_cols] print('train dimension: ', df_train.shape) print('test dimension: ', df_test.shape) df_train.head()
train dimension: (844338, 37) test dimension: (41088, 37)
for cat_col in cat_cols: df_train[cat_col] = df_train[cat_col].astype('category') df_test[cat_col] = df_test[cat_col].astype('category') df_train.head()

Model Training

We use a helper class to train a boosted tree model, generate the prediction on our test set, create the submission file, check the feature importance of the tree-based model and also make sure we can save and re-load the model.

from gbt_module.model import GBTPipeline model = GBTPipeline(input_cols, cat_cols, label_col, weights_col, model_task, model_id, model_type, model_parameters, model_hyper_parameters, search_parameters) model
GBTPipeline(cat_cols=['Assortment', 'CompetitionMonthsOpen', 'CompetitionOpenSinceYear', 'Day', 'DayOfWeek', 'Events', 'Month', 'Promo2SinceYear', 'Promo2Weeks', 'PromoInterval', 'State', 'StateHoliday', 'Store', 'StoreType', 'Week', 'Year'], input_cols=['CloudCover', 'CompetitionDistance', 'Max_Humidity', 'Max_TemperatureC', 'Max_Wind_SpeedKm_h', 'Mean_Humidity', 'Mean... 'max_depth': [3, 5, 8, 10, 12], 'subsample': [0.7, 0.8, 0.9]}, model_id='gbt', model_parameters={'learning_rate': 0.01, 'min_data_in_leaf': 100, 'n_estimators': 3000, 'n_jobs': -1}, model_task='regression', model_type='lgb', search_parameters={'n_iter': 3, 'n_jobs': -1, 'random_state': 1234, 'return_train_score': True, 'scoring': 'neg_mean_squared_error', 'verbose': 1}, weights_col=None)
start = time.time() model.fit(df_train, val_fold, model_fit_parameters) elapsed = time.time() - start print('elapsed minutes: ', elapsed / 60)
Fitting 1 folds for each of 3 candidates, totalling 3 fits
/Users/mingyuliu/anaconda3/lib/python3.6/site-packages/lightgbm/__init__.py:46: UserWarning: Starting from version 2.2.1, the library file in distribution wheels for macOS is built by the Apple Clang (Xcode_8.3.3) compiler. This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore. Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler. You can install the OpenMP library by the following command: ``brew install libomp``. "You can install the OpenMP library by the following command: ``brew install libomp``.", UserWarning) [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 20.0min finished /Users/mingyuliu/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py:1209: UserWarning: categorical_feature in Dataset is overridden. New categorical_feature is ['Assortment', 'CompetitionMonthsOpen', 'CompetitionOpenSinceYear', 'Day', 'DayOfWeek', 'Events', 'Month', 'Promo2SinceYear', 'Promo2Weeks', 'PromoInterval', 'State', 'StateHoliday', 'Store', 'StoreType', 'Week', 'Year'] 'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
Training until validation scores don't improve for 5 rounds. [100] valid_0's l2: 0.0718742 valid_0's l2: 0.0718742 valid_1's l2: 0.0654301 valid_1's l2: 0.0654301 [200] valid_0's l2: 0.0450121 valid_0's l2: 0.0450121 valid_1's l2: 0.0404465 valid_1's l2: 0.0404465 [300] valid_0's l2: 0.0336353 valid_0's l2: 0.0336353 valid_1's l2: 0.0314702 valid_1's l2: 0.0314702 [400] valid_0's l2: 0.027701 valid_0's l2: 0.027701 valid_1's l2: 0.0269131 valid_1's l2: 0.0269131 [500] valid_0's l2: 0.0240796 valid_0's l2: 0.0240796 valid_1's l2: 0.0240597 valid_1's l2: 0.0240597 [600] valid_0's l2: 0.0218724 valid_0's l2: 0.0218724 valid_1's l2: 0.0218025 valid_1's l2: 0.0218025 [700] valid_0's l2: 0.0201603 valid_0's l2: 0.0201603 valid_1's l2: 0.0201185 valid_1's l2: 0.0201185 [800] valid_0's l2: 0.0184851 valid_0's l2: 0.0184851 valid_1's l2: 0.0184089 valid_1's l2: 0.0184089 [900] valid_0's l2: 0.0168288 valid_0's l2: 0.0168288 valid_1's l2: 0.0166787 valid_1's l2: 0.0166787 [1000] valid_0's l2: 0.015335 valid_0's l2: 0.015335 valid_1's l2: 0.0151688 valid_1's l2: 0.0151688 [1100] valid_0's l2: 0.0142658 valid_0's l2: 0.0142658 valid_1's l2: 0.0141122 valid_1's l2: 0.0141122 [1200] valid_0's l2: 0.0135508 valid_0's l2: 0.0135508 valid_1's l2: 0.013411 valid_1's l2: 0.013411 [1300] valid_0's l2: 0.012954 valid_0's l2: 0.012954 valid_1's l2: 0.0128038 valid_1's l2: 0.0128038 [1400] valid_0's l2: 0.0124204 valid_0's l2: 0.0124204 valid_1's l2: 0.0122977 valid_1's l2: 0.0122977 [1500] valid_0's l2: 0.0119961 valid_0's l2: 0.0119961 valid_1's l2: 0.011941 valid_1's l2: 0.011941 [1600] valid_0's l2: 0.01163 valid_0's l2: 0.01163 valid_1's l2: 0.0115995 valid_1's l2: 0.0115995 [1700] valid_0's l2: 0.0113139 valid_0's l2: 0.0113139 valid_1's l2: 0.0112803 valid_1's l2: 0.0112803 [1800] valid_0's l2: 0.0110184 valid_0's l2: 0.0110184 valid_1's l2: 0.0109884 valid_1's l2: 0.0109884 [1900] valid_0's l2: 0.0107764 valid_0's l2: 0.0107764 valid_1's l2: 0.0107527 valid_1's l2: 0.0107527 [2000] valid_0's l2: 0.0105697 valid_0's l2: 0.0105697 valid_1's l2: 0.0105339 valid_1's l2: 0.0105339 [2100] valid_0's l2: 0.0103475 valid_0's l2: 0.0103475 valid_1's l2: 0.0103047 valid_1's l2: 0.0103047 [2200] valid_0's l2: 0.0101752 valid_0's l2: 0.0101752 valid_1's l2: 0.0101483 valid_1's l2: 0.0101483 [2300] valid_0's l2: 0.0100167 valid_0's l2: 0.0100167 valid_1's l2: 0.00999843 valid_1's l2: 0.00999843 [2400] valid_0's l2: 0.00986898 valid_0's l2: 0.00986898 valid_1's l2: 0.00983577 valid_1's l2: 0.00983577 [2500] valid_0's l2: 0.00972949 valid_0's l2: 0.00972949 valid_1's l2: 0.00969056 valid_1's l2: 0.00969056 [2600] valid_0's l2: 0.00960752 valid_0's l2: 0.00960752 valid_1's l2: 0.00955977 valid_1's l2: 0.00955977 [2700] valid_0's l2: 0.00948489 valid_0's l2: 0.00948489 valid_1's l2: 0.00944049 valid_1's l2: 0.00944049 [2800] valid_0's l2: 0.00936427 valid_0's l2: 0.00936427 valid_1's l2: 0.00928775 valid_1's l2: 0.00928775 [2900] valid_0's l2: 0.00924005 valid_0's l2: 0.00924005 valid_1's l2: 0.00915616 valid_1's l2: 0.00915616 [3000] valid_0's l2: 0.00912308 valid_0's l2: 0.00912308 valid_1's l2: 0.00904488 valid_1's l2: 0.00904488 Did not meet early stopping. Best iteration is: [3000] valid_0's l2: 0.00912308 valid_0's l2: 0.00912308 valid_1's l2: 0.00904488 valid_1's l2: 0.00904488 elapsed minutes: 22.69019781748454
pd.DataFrame(model.model_tuned_.cv_results_)
# we logged our label, remember to exponentiate it back to the original scale prediction_test = model.predict(df_test[input_cols]) df_test[label_col] = np.exp(prediction_test) submission_cols = id_cols + [label_col] df_test[submission_cols] = df_test[submission_cols].astype('int') submission_dir = 'submission' if not os.path.isdir(submission_dir): os.makedirs(submission_dir, exist_ok=True) submission_file = 'rossmann_submission_{}.csv'.format(model_id) submission_path = os.path.join(submission_dir, submission_file) df_test[submission_cols].to_csv(submission_path, index=False) df_test[submission_cols].head()
model.get_feature_importance()
[('Store', 0.6707), ('Promo', 0.1081), ('BeforePromo', 0.0653), ('DayOfWeek', 0.0574), ('Week', 0.0463), ('Day', 0.019), ('AfterStateHoliday', 0.0049), ('BeforeStateHoliday', 0.0045), ('CompetitionDistance', 0.0041), ('StoreType', 0.0039), ('Month', 0.003), ('Year', 0.0022), ('State', 0.0016), ('CompetitionOpenSinceYear', 0.0015), ('AfterPromo', 0.001)]
model_checkpoint = os.path.join('models', model_id + '.pkl') model.save(model_checkpoint) loaded_model = GBTPipeline.load(model_checkpoint) # print the cv_results_ again to ensure the checkpointing works pd.DataFrame(loaded_model.model_tuned_.cv_results_)