Path: blob/main/Lessons/Lesson 08 - Hyperparameter Optimization (Project)/Lesson_08.ipynb
871 views
Lesson 08: Hyperparameter Optimization for Machine Learning (Project)
Some code in this notebook runs pretty slowly in CoCalc. If you have the ability, you might wish to run this notebook locally or on a more powerful remote machine.
Introduction
Fitting a model in machine learning is an optimization problem. In a previous lesson we saw how logistic and linear regression use optimization to find the regression model coefficients to minimize the difference between observed and predicted values of the response variable.
Most machine learning models also come with a bunch of parameters that need to be set which can alter the fit of the model. For example, here is the LogisticRegression
class from scikit learn (sklearn
):
Some of these parameters have to do with exactly what model is fit. For instance, penalty
changes the form of regularization added to the objective function to prevent overfitting while C
changes the strength of the regularization (larger C
is less regularization). These extra parameters are usually called hyperparameters and to get the best model they often need to be tuned. This tuning is another kind of optimization and is usually called "hyperparameter optimization" or "hyperparameter tuning". This is a hot area and a little searching with Google will yield a ton of results. Here is one article that gives an overview of hyperparameter tuning methods (but gets a bit technical at the end).
To keep everything straight it helps to remember that model parameters and hyperparameters are different. Hyperparameters are set or determined before the model is fit. Model parameters are determined during the process of fitting the model to the data.
Here are four kinds of hyperparameter optimization we'll explore:
Grid Search: choose a list of possible values for each hyperparameter and loop over all the combinations of hyperparameters while fitting a model for each combination. Return the combination that gives the best model performance. We'll use
GridSearchCV
from thesklearn
package.Random Search: choose a list of possible values, or a probability distribution, for each hyperparameter. Choose combinations at random and return the combination that gives the best model performance. We'll use
RandomSearchCV
from thesklearn
package.Bayesian Optimization: after each iteration an approximate model of the objective function, called a surrogate model, is built and used to suggest a better set of hyperparameters for the next iteration. This is really useful for objective functions that are expensive to evaluate and for which there is uncertainty or noise in the function values as there always will be when we sample data at random to feed into a machine learning model. We'll use
BayesianSearchCV
from theScikit-Optimize
(skopt) package. Other popular Python packages includeHyperOpt
andGPyOpt
. We aren't going to study the details of Bayesian Optimization in this class, but there are a number of tutorials on the topic. I think this article is an especially good place to start if you want to learn more about the details. In particular, the section called "Implementation with NumPy and SciPy" walks through the process of maximize a simple .Genetic Algorithms: a population of possible hyperparameter sets is evolved using selection, crossover, and mutation with the goal of identifying hyperparameters that yield trained models with the best performance. We'll use the TPOT package. We'll also see that the TPOT package is really an auto machine learning package because it can pick between multiple models using the genetic algorithm.
There are other hyperparameter optimization approaches, but Bayesian Optimization is currently the most popular. Work through this presentation to get an idea of how hyperparameter optimization works!
If you haven't taken DS740 - Data Mining and Machine Learning yet, you can still follow along and do the assignment, but you might wish to revisit this assignment after you've completed DS740. We're going to walk through the four methods of hyperparameter optimization using a problem involving regression. Your assignment will be to apply the same methods to the classification problem of predicting loan defaults using a subset of the project data from DS705 - Statistical Methods class.
But first... The Regression Problem
Before we try to understand hyperparameter optimization, let's first explore the data we'll be using and the type of machine learning we'll be doing.
We'll be using a pre-processed diamonds dataset. You can read more about the variables in the diamond dataset in this blog post. We have used Pycaret to pre-process the data and output it as a CSV file stored in the data directory. First we'll read in the data.
Our objective is to predict the target variable (Price) from the other 40 features.
Setup the data
In a full machine learning analysis you would want to explore the data, do feature selection and possibly feature engineering including variable rescaling. Since you are not required to know those concepts before this course, we've done that for you. We're going to focus on only hyperparameter optimization.
We've already loaded the data, but we'll grab our features, X
, and our target variable y
. We could just fit a model to all of the data, but we don't want a model that's just memorized the data and can't generalize to new data (overfitting). So we usually divide the data into test and training sets. The training set is used to fit a model, while the test set is used to validate the result. Typically around 20% of the data is randomly selected for the test set. We set the random number seed in train_test_split
for reproducibility.
Establishing a baseline
We'll train a few different models just to get an idea of how things work and to establish a baseline. By default these models fit the data by minimizing the mean squared error which is the average squared difference between the predicted target values and the actual target values. A closely related quantity is r-squared which is the proportion of the total variation in the target values that is captured by the predicted values.
Our goal, for each regression model, is to maximize r-squared as measured on the test data.
Linear Regression and sklearn basics
This is the same multiple linear regression model you've learned about in statistics. We saw a bit about fitting a simple linear regression model in Python as part of Lesson 4. Fitting a model in sklearn
is pretty straightforward.
In a basic machine learning pipeline we could look at the model score. This is basically the objective function value being optimized by sci-kit (or a quantity derived from that). For regression models we usually look at either
mean squared error: the average squared difference between the true and predicted target values or
r-squared: the proportion of the total variation in the target values that is accounted for by the model
Mathematical representation of and MSE:
Definitions:
= actual ith observation
= predicted ith observation
= mean of actual observations
= degrees of freedom (or num observations)
Total sum of squares (proportional to the variance of the data):
Residuals sum of squares , also called sum of squares residuals:
Mean squared error:
MSE =
The most general definition of the coefficient of determination is:
For regression, if we compute the score of the model applied to the training data:
This value of r-squared is the same as the value you met in statistics. We interpret this to mean that our linear regression model captures 87.3% of the variation in the training data. However, in a machine learning context we want to know how this model works on new data.
If we apply the score() method to the test data it's no longer the value of r-squared that we learned about in a statistics class. This is because we are evaluating r-squared score on data that was not used to build the model. For example, we can get a negative r-squared number which indicates that the model is performing worse, on the test data, than simply predicting that all of the target values are the same as the average target value.
In short, when we compute the r-squared score() for the test data, values near one are great. Values near zero just mean the model isn't really helping and negative values mean that are model is worse than no model at all. This is the metric we will use to select our regression models so bigger is better. Here is how we compute it:
This number doesn't have the same clean interpretation as r-squared in a statistics setting, but it is useful for comparing models if we remember larger scores are better. (Note: if we were looking at mean square instead, then smaller would be better.)
Finally we can use our model to make predictions:
From the plot we can see that the model is far from perfect, but it is getting the overall trend right. One thing to note is that it's doing a pretty poor job at predicting large target values.
For a real-world problem we'd want to assess the accuracy of the model predictions on the test data. For regression problems this is often done with the mean square error or root mean square error (rmse):
You can interpret the rmse to roughly mean that the average error in our predictions is about $3682. Below we condense the code into one cell to make it easier to follow.
Now it's very easy to assess other models by first creating a model object in sklearn and then using our custom function my_regression_results
.
Random Forest Regression
Random Forest regression uses an ensemble of decision trees and combines their final output to build a predictive model. If you haven't taken DS740 yet, that's OK. All we really need to know is that a Random Forest regressor is a predictive model with a bunch of hyperparameters that can be changed and often are very influential in the model performance. These models are computationally expensive to train because a typical ensemble uses 10 to 100 or more decision trees.
The random forest model, with default parameters, is much better than the linear regression model.
XGBoost
XGBoost is a decision-tree-based ensemble algorithm that uses a gradient boosting framework to produce some pretty fantastic results. We won't go into the details, but this article has a pretty nice description and talks about a variety of decision-tree based algorithms. XGBoost is typically quite a bit faster to train than a random forest algorithm, but can still be computationally expensive.
Depending on what random seeds and such you're using, you may find that xgboost does better or worse than some of the other models.
Our best model thus far with the seed we set is the XGBoost model. Perhaps using the default hyperparameter values for training the other models wasn't the best choice. In what follows, we'll try to improve the xgboost model by tuning its parameters.
Estimating the model score without test data
We'd like to optimize the scores of our models when applied to data the model hasn't seen. However, the model doesn't see the test data during model training. To estimate the test data model score we apply k-fold cross validation to the model training process.
This technique is taught in DS740 and you can learn more in this article. Basically the training data is divided into k subsets and the subsets are used to build models. The scores from these models are averaged to estimate the score when applied to unseen data. Cross validation is used by all of our hyperparameter optimization algorithms to estimate the model score and this estimated score is what we try to optimize.
We won't really have to do cross-validation directly, but this bit of code shows how we could do it using sklearn
. Here we are estimating the test error of our xgboost model using 3-fold cross-validation (3 is the default number of folds for cross_val_score
, but 5 is more commonly used for hyperparameter optimization):
Here is the important bit about cross-validation for estimating model performance: the k-fold cross-validated model score is the quantity we optimize in hyperparameter optimization.
For regression problems we are usually minimizing the k-fold cross-validated mean square error. For classification problems we maximize the k-fold cross-validated accuracy where accuracy is the total number of correct classifications divided by the total number of classifications. The number of folds used is commonly or , but we'll mostly use just to speed things up for learning purposes.
Hyperparameter Optimization applied to XGBoost Regression
We'll focus on the XGboost model for regression because it's a pretty amazing model. If you haven't heard about it, then try to Google a bit or check out this article to learn more. The XGboost model has many hyperparameters:
Fortunately the default values shown above work pretty well in many problems. Some of the hyperparameters don't directly change the model like nthread
and verbosity
. Of the rest we'll pick a subset to optimize. Some commonly optimized parameters are n_estimators
, max_depth
, learning_rate
, subsample
, and min_child_weight
(these are the same ones that are optimized in the TPOT
package). Two other hyperparameters linked to regularization terms are reg_lambda
and reg_alpha
which can be useful to prevent overfitting.
The table below lists some typical values:
Hyperparameter | Typical Range |
---|---|
n_estimators | 10 to 150 |
max_depth | 1 to 10 |
min_child_weight | 1 to 20 |
learning_rate | 0.001 to 1 |
subsample | 0.05 to 1 |
reg_lambda | 0 to 5 |
reg_alpha | 0 to 5 |
Of course, we could throw more hyperparameters into the mix, but we'll keep the numbers down to so we can afford to experiment.
Using GridSearchCV
The idea behind grid search is to pick a list of potential values for each hyperparameter and then search all the possible combinations doing a k-fold cross validation for each combination. That means we have to do k * number of combinations model fits. If we were to run this code
we have and combinations for a total of 40,000,000 model fits. Even if we could fit 10 models per second, it would still take about 46 days to try all the models. Nevertheless, GridSearchCV is commonly used by trimming the number of possible values for each parameter to get something manageable.
As you can see the lists or arrays of values for the hyperparameters are stored in a dictionary. The number of cross-validation folds is set by cv = 3
. (cv = 5 is better, but we've selected 3 for reduced run time)
To illustrate how this works we'll pick fewer values for each hyperparameter as shown in the next cell, but we are still doing model fits which takes several minutes to run on our computers.
The best hyperparameter values are stored in the grid_search object as a dictionary:
To see if this optimized model is better than the default XGBoost model let's apply it to the test data:
This is a nice improvement over the default XGBoost regression model. Our tuned model now performs better than the linear regression model we saw above.
The main drawback to grid search is that it can get really expensive if we want to exhaustively search, particularly if the model fits are slow as they can be when large datasets with many predictors are used.
Before we get too far, let's create a way to track our results. In the next cell, we have a function that will track the approach we're using, the best hyperparameters, and the RMSE. Note that we call it twice - once to enter the data from our default XGBoost model, and once to add the results from our grid search approach. Note that we use show=False the first time, so that we only output a single dataframe of results. We return our current dictionary of results, which we can use each successive time we call the function.
Using RandomizedSearchCV
If we can only afford to fit the model a limited number of times, then one approach is to search randomly instead of exhaustively. To use RandomizedSearchCV
we can either specify a probability distribution for each hyperparameter or we can specify a list of values for the hyperparameter in which case a value is chosen from the list assuming all values in the list are equally probable.
For optimizing our XGBoost model Wwe'll leave the learning rate as a list since we want more small values to choose from than large values. The other hyperparameters can be specified with distributions. Note that the uniform distribution specified below is not intuitive. uniform(loc,scale)
is uniform on the interval [loc, loc+scale]
. For the search below we're going to check just 25 randomly selected sets of hyperparameters as we might for a really expensive model. random_state = 8675309
is a random number seed for reproducibility. Change it and you'll get different results.
The best hyperparameters found:
The random search in this case did not find a very good fit. Note that it did not choose any of the default parameters. The default parameters outperformed those chosen by the random search.
Bayesian Optimization with Scikit-Optimize
Scikit-Optimize is an optimization package built on top of scikit-learn. It allows you to tune your hyperparameters using Bayesian optimization, as well as do some visualization of your optimization results.
On average Bayesian optimization does better than random search. It especially excels when there are lots of hyperparameters, but it won't beat random search every time. The power of Bayesian optimization is that it can often achieve good results with a relatively short number of training iterations.
The setup is a quite similar to random search, with just slight differences in how we set up our parameters, called dimensions in the documentation.
for lists of values, we need to wrap the list in the Categorical() function
when we want a range of integers optimized, we just pass the low and high values
when we want a range of real numbers, we pass the low and high, making sure to use floats or wrap it in REAL()
when we know the distribution, we can pass the name of the distribution, along with the low and high (the default is uniform)
Depending on the random seeds you've used you might see that Bayesian Optimization does better or worse than other methods. (Try a few different seeds and compare.) For the particular seed that we set, Bayes didn't do as well as grid search, and used far fewer fits (75 vs 384).
Both random search and Bayesian optimization will give better results if they're allowed to run for more iterations. Bayesian optimization doesn't always beat random or grid search, but common wisdom suggests that it usually works better - see this Stack Exchange post for some nice discussion and references. Moreover companies are getting into automatic machine learning in a big way and some of the giants, like Google, are betting on Bayesian Optimization.
Using a Genetic Algorithm from TPOT
We really don't need to do much with genetic algorithms to use TPOT, though we can change the usual parameters like population size, mutation probability, and crossover probability. The software authors recommend leaving the probabilities at their default values - the documentation is here.
TPOT can actually do much more than optimize hyperparameters for a single model, but it can do that too. To focus on a single model we set up a nested dictionary like that shown in the code below. Then we call TPOT and it returns an optimized model in an object that behaves just like objects returned by GridSearchCV and RandomSearchCV. Additional models could be added as 'model_name':{'param':values,'param':value,...}
.
We've found that we generally need more model fits to get good results with TPOT than we did with Bayesian Optimization, but it still works really well. Note that TPOT is maximizing the k-fold cross-validated negative mean square error instead of r-squared, but it gets us to the same place. Here we iterate for 10 generations with 10 different individual sets of hyperameters in each generation. For each individual we do 3 model fits (k = 3) and there is an extra round of cross validated fits for the initial population, thus altogether we perform
One nice feature of TPOT is that it can write the optimized model to a file which you can use elsewhere.
We can display the results on the test data in the same way as with our other models.
To track our best parameters, you'll need to make a dictionary based on the "Best pipeline" printed out from tpot. Unfortunately, we have not found a way to capture this programmatically. But it's pretty easy to cut and paste.
TPOT is the best so far (but that can change with seeds and new versions of the packages, so if you run this notebook you may see some differences!)
AutoML with TPOT
We've actually used TPOT in a rather narrow way by forcing it to optimize the hyperparameters for one choice of a machine learning model. However, TPOT is really designed as an auto machine learning tool (AutoML) that tries to figure out optimize the whole machine learning pipeline: data preprocessing, feature selection, model selection, and hyperparamter tuning. For real problems this process could take days (see the TPOT discussion of AutoML. For this toy problem it doesn't take too long so let's see what it does. In practice you would want to run the optimization as long as possible by increasing the number of generations and the population size.
By specifying None
for the config_dict parameter TPOT defaults to optimizing the whole machine learning pipeline. We'll turn it loose with a population size of 20 for 5 generations and a cv of 3. This ran in about 5 minutes on CoCalc. (This is far fewer generations and a much smaller population size that would be desirable in the real world.)
In this instance, TPOT chose an ExtraTreesRegressor. Since the parameters are different for that model, we can't just include the parameters in our results the way we have before. But, we can still track our RMSE and fits.
Be wary of making statements like "the best model is the one from TPOT" because the exact values of RMSE are likely to change if you run with different seeds, allow more fits, etc. Rather, view these hyperparameter techniques as a way to automate the process of finding hyperparameters to achieve a good model.
Understanding the details of nested models and such isn't important here and we don't recommend blindly using AutoML of any sort, but TPOT can provide good starting points and suggestions for models to investigate further. We're eager to try other AutoML tools to see how they work.
Summary
After exploring several different hyperparameter optimization tools, we found that all of them improved the XGBregressor
model by varying amounts. Looking just at the hyperparameter optimization of the xgbr_model
we found that GridSearchCV
was the most expensive with 960 model fits, but found a very good model. RandomSearchCV
and BayesianOptimizaion
used 125 model fits and 175 model fits, respectively and BayesianOptimization
identified the model with the lowest MSE on the test data. However, be careful before concluding that Bayesian Optimization outperforms Random Search. If you change the random number seeds you'll get different results and Bayesian Optimization will not always be the winner. However, the consensus is that it works better than Random Search on average.
The last thing we ran was an AutoML experiment with TPOT which used a genetic algorithm to search over many different models and hyperparameter choices. The model it identified was a pretty crazy nested model that had a slight performance boost over the the optimized xgboost
model. Due the complexity of that model it might not be the best choice, but it does provide a starting point for other directions to look.
If you're curious to explore further, there are many AutoML tools being developed. Here are a couple of interesting ones with which you might experiment:
AzureML from Microsoft: Check out this really cool tutorial on using AutoML for choosing a regression model for predicting taxi fares. The tutorial uses Python and sklearn so it wouldn't be a stretch to follow along. Moreover, AzureML provides free credits when you sign up. Just make sure to complete the "Clean Up Resources" section at the end of the tutorial so you don't leave anything running that will use up your free credits!
RapidMiner: We don't have personal experience with this one, but we've only heard good things about it and are eager to check it out. It is free for students. RapidMiner's version of AutoML is a called Auto Model. You can find a tutorial for predicting survival on the Titanic here.
Pycaret: If you're familiar with Caret in R, Pycaret should excite you. PyCaret is in development, but new tools are being added all the time. Unfortunately, it doesn't work quite right on CoCalc without downgrading scikit-learn. We encourage you to create your own local environment and play with it on your own, though. There are some good tutorials.
If you try any other AutoML tools, please tell us about it on Piazza.
Assignment
Optimize a random forest regression model and a XGboost classification model by completing the work in the Lesson_08_Homework.ipynb
notebook.