Path: blob/master/big_data/h2o/h2o_api_walkthrough.ipynb
2597 views
H2O API Walkthrough
General Setup
The first question you might be asking is why H2O instead of scikit-learn or Spark MLlib.
People would prefer H2O over scikit-learn because it is much straightforward to integrate ML models into an existing non-Python system, i.e., Java-based product.
Performance wise, H2O is extremely fast and can outperform scikit-learn by a significant amount when the data size we're dealing with large datset. As for Spark, while it is a decent tool for ETL on raw data (which often is indeed "big"), its ML libraries are often times outperformed (in training time, memory footprint and even accuracy) by much better tools by orders of magnitude. Please refer to the benchmark at the following link for more detailed number-based proofs. Github: A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations
In this example, we will be working with a cleaned up version of the Lending Club Bad Loans dataset. The purpose here is to predict whether a loan will be bad (i.e. not repaid to the lender). The response column, bad_loan, is 1 if the loan was bad, and 0 otherwise.
Since the task we're dealing at hand is a binary classification problem, we must ensure that our response variable is encoded as a factor type. If the response is represented as numerical values of 0/1, H2O will assume we want to train a regression model.
Next, we perform a three-way split:
70% for training
15% for validation
15% for final testing
We will train a data set on one set and use the others to test the validity of the model by ensuring that it can predict accurately on data the model has not been shown. i.e. to ensure our model is generalizable.
Here, we extract the column name that will serve as our response and predictors. These informations will be used during the model training phase.
Modeling
We'll now jump into the model training part, here Gradient Boosted Machine (GBM) is used as an example. We will set the number of trees to be 500. Increasing the number of trees in a ensemble tree-based model like GBM is one way to increase performance of the model, however, we have to be careful not to overfit our model to the training data by using too many trees. To automatically find the optimal number of trees, we can leverage H2O's early stopping functionality.
There are several parameters that could be used to control early stopping. The three that are generic to all the algorithms are: stopping_rounds, stopping_metric and stopping_tolerance.
stopping metricis the metric by which we'd like to measure performance, and we will chooseAUChere.score_tree_intervalis a parameter specific to tree-based models. e.g. setting score_tree_interval=5 will score the model after every five trees.
The parameters we have specify below is saying that our model will stop training after there have been three scoring intervals where the AUC has not increased more than 0.0005. Since we have specified a validation frame, the stopping tolerance will be computed on validation AUC rather than training AUC.
To examine the scoring history, use the scoring_history method on a trained model. When early stopping is used, we see that training stopped at before the full 500 trees. Since we also used a validation set in our model, both training and validation performance metrics are stored in the scoring history object. We can take a look at the validation AUC to observe that the correct stopping tolerance was enforced.
Hyperparameter Tuning
When training machine learning algorithm, often times we wish to perform hyperparameter search. Thus rather than training our model with different parameters manually one-by-one, we will make use of the H2O's Grid Search functionality. H2O offers two types of grid search -- Cartesian and RandomDiscrete. Cartesian is the traditional, exhaustive, grid search over all the combinations of model parameters in the grid, whereas Random Grid Search will sample sets of model parameters randomly for some specified period of time (or maximum number of models).
If we wish to specify model parameters that are not part of our grid, we pass them along to the grid via the H2OGridSearch.train() method. See example below.
To compare the model performance among all the models in a grid, sorted by a particular metric (e.g. AUC), we can use the get_grid method.
Instead of running a grid search, the example below shows the code modification needed to run a random search.
In addition to the hyperparameter dictionary, we will need specify the search_criteria as RandomDiscrete' with a number for max_models, which is equivalent to the number of iterations to run for the random search. This example is set to run fairly quickly, we can increase max_models to cover more of the hyperparameter space. Also, we can expand the hyperparameter space of each of the algorithms by modifying the hyperparameter list below.
Lastly, let's extract the top model, as determined by the validation data's AUC score from the grid and use it to evaluate the model performance on a test set, so we get an honest estimate of the top model performance.
Model Interpretation
After building our predictive model, often times we would like to inspect which variables/features were contributing the most. This interpretation process allows us the double-check to ensure what the model is learning makes intuitive sense and enable us to explain the results to non-technical audiences. With h2o's tree-base models, we can access the varimp attribute to get the top important features.
We can return the variable importance as a pandas dataframe, hopefully the table should make intuitive sense, where the first column is the feature/variable and the rest of the columns are the feature's importance represented under different scale. We'll be working with the last column, where the feature importance has been normalized to sum up to 1. And note the results are already sorting in decreasing order, where the more important the feature the earlier the feature will appear in the table.
We can also visualize this information with a bar chart.
Another type of feature interpretation functionality that we can leverage in h2o is the partial dependence plot. Partial dependence plots is a model-agnostic machine learning interpretation tool. Please consider walking through another resource if you're not familiar with what it's doing.
We use the .partial_plot method for a h2o estimator, in this case our gbm model, then we pass in the data and the column that we wish to calculate the partial dependence for as the input argument. The result we get back will be the partial dependence table as shown above. To make the process easier, we'll leverage a helper function on top of the h2o .partial_plot method so we can also get a visualization for ease of interpretation.
Based on the visualization above, we can see that the higher your annual income, the lower the likelihood it is for your loan to be bad, but after a certain point, then this feature has no affect on the model.
As for the feature term we can see that people having a 36 months loan term is less likely to have a bad loan compared to the ones that have a 60 months loan term.