Path: blob/master/model_selection/ray_tune_hyperband.ipynb
1470 views
HyperParameter Tuning Ray Tune and HyperBand
One of steps in training a machine learning model involves hyperparameter tuning, and two most common hyper parameter tuning strategies that we might first come across are grid and random search.
In this article, we will take a look at how we can perform hyperparameter tuning using Ray Tune, as well as explore another hyperparameter tuning strategy called HyperBand.
We'll be using xgboost library as well as a sample dataset provided by scikit-learn in this example, there will be no feature preprocessing as that is not the focus of this post.
We first train a model using the default parameters to get a baseline performance number.
Hyperparameter Tuning
To use hyperparameter tuning with ray, we need to:
Have a config dictionary, so tune can choose from a range of valid options.
Use the config dictionary in our model object.
Once we are done training the model, report all the necessary metrics.
For running hyperparameter tuning:
We pass our training function/callable,
ray_train
, as the first parameter. Here we leveragewith_parameters
so we can broadcast large objects to our trainable.We specify additional necessary arguments such as what metrics to optimize for as well as resources, and the hyperparameter tuning config space.
ray allows us to specify a time budget along with
num_samples
of -1, this allows us to train for an infinite sample of configurations until a time budget is met.
HyperParameter Tuning with HyperBand
Apart from grid or random search, ray tune offers multiple hyperparameter tuning strategies, here we will be looking at one of them called Hyperband.
Hyperband can be seen as successive halving algorithm on steriods that focuses on speeding up configuration evaluation, where configuration refers to one specific set of hypereparameters. To elaborate, successive halving works by allocating a certain amount of budget to a set of hyper parameter configurations, i.e. it runs the configuration for a few iterations to get a sense of their performance, after that it will start allocating more resources to more promising configurations, while tossing away other non-performing configurations. This process repeats until one configuration remains. One of the potential drawback with successive halving is that given some finite budget (e.g. training time), and number of configurations , it is not clear a priori whether we should either consider many configurations (large ), each with a small average training time, or the opposite, i.e. consider a small number of configurations (large ), each having a larger average training time. In other words, as practitioners, how do we decide whether we want more "depth" or more "breadth". Let's now take a look at how Hyperband aims to address this issue:
Looking at the psuedocode above, Hyperband takes in two inputs:
R: The maximum resources that can be allocated to a single configuration, e.g. number of iterations to run the algorithm.
: Controls the proportion of configurations to be discarded for each round of successive halving.
Then it essentially performs a grid search over different possible values of , associated with is a minimum resource that is allocated to each configuration. Lines 1-2, the outer loop, iterates over different values of and , whereas the inner loop, lines 3–9, runs successive halving for the fixed and .
The following code chunk provides a vanilla implementation, and returns the resource allocation table.
Notice in the last row, , where every configuration is allocated resources, this setting is essentially performing our good old random search. On the other extreme end of things, the first row , we are essentially running 81 configurations each for only 1 iteration, then proceeding on to dropping 2/3 of the bottom performing configurations. By performing a mix of more exploration and more exploitation search strategies, it automatically accomodates for scenarios where an iterative training algorithm converges very slowly and require more resources to show differentiating performance (in these scenarios, we should consider smaller ), as well as the opposite end of the story, where we perform aggresive early stopping to provide massive speedups and scan many different combinations.
To leverage hyperband tuning algorithm, we'll use Ray Tune's scheduler ASHAScheduler
(recommended over the standard hyperband scheduler). We will also need to report our model's loss for every iteration back to tune. Ray Tune already comes with a callback class TuneReportCallback
that does this without us having to implement it ourselves.
We can retrieve the best config, and re-train our model to check if we get similar performance numbers.
Ray Tune provides different hyperparameter tuning algorithms other than the classic grid or random search, here we only looked at one of them, Hyperband.
Caveat: If learning rate is a hyperparameter, smaller values will likely result in inferior performance at the beginning, but may outperform other configurations if given sufficent amount of time. Hence, when using hyperhand like hyperparameter tuning methods, it might not be able to find the small learning rate and many iterations combinations that can squeeze out performance.