Path: blob/main/Lessons/Lesson 06 - Global Optimization 1/Lesson_06.ipynb
871 views
Lesson 06: Global Optimization 1
This week we're going to continue our exploration of global optimization by looking at two global optimization algorithms: Simulated Annealing and Bayesian Optimization.
Global Optimization
The goal of global optimization is to find the global optimal value which means we want to identify the best possible solution in the entire search space. However, for many problems the search space is too large and/or the function landscape is too complicated to guarantee that the best solution can be found.
A metaheuristic algorithm attempts to find a good solution without any guarantee of being able to find the best solution. Often metaheuristics are stochastic in nature, that is they incorporate randomness as an element of the search, but they aren't generally completely random in nature. They incorporate search patterns which are known to work well for the problem at hand.
Metaheuristic algorithms try to find a compromise somewhere between randomly searching the search space and local search. People speak of the exploration and exploitation tradeoff. Exploration ensures the algorithm reaches different promising regions of the search space, whereas exploitation ensures the searching of optimal solutions within the given region. The trick is in finding the right balance. Go too far into exploitation and the algorithm gets stuck in local extrema, go too far to exploration and the algorithm will waste time on solutions that are less likely to be good and ignore the information already gathered.
Unfortunately, there is no single algorithm which works best for all classes of problems. This is often referred to as a "no free lunch theorem" in optimization. In this lesson, we focus on Simulated Annealing and Bayesian Optimization.
Simulated Annealing
Think of simulated annealing as an enhanced local search that allows some moves that don't improve the best function value to try to climb
In a hill-climbing local search we only allow moves that increase the objective function value.
Here is our pseudo-code from the previous lesson for Local Search:
Simulated annealing is a trajectory based method for generating a sequence of solutions and is similar our basic "hill-climbing" local search algorithm. In a strict hill-climbing algorithm we only allow uphill moves, but in simulated move we sometimes allow downhill moves and are more likely to allow downhill moves in the early part of the search. The idea is that to find the tallest peak in a mountain range we have to first descend from a lower peak.
The probability of a downhill move is determined by a temperature parameter that decreases throughout the search. The probability of a downhill move depends on the size of the downhill move compared to the temperature. At high temperatures large and small downhill moves are probable, but as the temperature decreases only small downhill moves are probable so that the search performs similarly to a local search at low temperatures
In simulated annealing algorithm high temperature promotes exploration (global search) while low temperature promote exploitation (local search). As the algorithm proceeds, the temperature decreases and transitions from exploration to exploitation.
Here is pseudocode for Simulated Annealing:
Choosing the initial temperature and the manner in which the temperature decreases are critical to the performance of simulated annealing. We'll start with a temperature schedule that looks like this: Where is the initial temperature, and is the number of iterations. This is called geometric temperature decay, but many other choices are possible. However, there aren't really rules for how to choose and . Usually we'll choose to be similar to the size of the function to be optimized when it is evaluated at random points and should be close to like . You may have to experiment with the values of these parameters to achieve good results. The simanneal package introduced below includes some automatic tuning of the temperature schedule that seems to work well for many problems, but it can be much slower than setting a fixed schedule for some problems.
In the next section we'll demonstrate simulated annealing using our own code for the traveling salesman problem, but in general we'll use the simanneal
package that will be introduced further below.
Simulated Annealing with TSP (video)
We'll use the seven city example TSP from the textbook. Find the shortest tour (or cheapest cost) to visit all 7 cities and return to the starting city in the following graph:
We'll store all the intercity distances in a two-dimensional list that we call distance_matrix. For cities that aren't connected we'll use the "bigM" method and introduce a distance of 100 between those pairs of cities so that those routes won't be included in the tour. Note that the picture labels the cities 1 through 7, but in Python we'll use 0 through 6. The data is stored in the included json file.
If you want to really understand how simulated annealing works, the following video includes a walkthrough of the code below.
IMPORTANT NOTE (9/23/2023): I say something backward in the video below when discussing (around 2:20) what to do if delta is positive or negative. If delta is positive that means the new tour is shorter so we should always accept it, if delta is negative the new tour is longer so compute a probability to see if you accept the longer tour. (Thanks for discovering this, Anton.)
Self Assessment: Simulated Annealing for Value Balancing
Revisit the Value Balancing Problem introduced in Lesson 5 (it's the next-to-last section in the lesson). Start with the local search code that is included in the self-assessment solutions and adapt it to do simulated annealing for the Value Balancing problem. First try it on the four item problem and then see how it performs on the 1000 item problem. Do you get good solutions?
Using the simanneal
package (video)
The simanneal
package is pretty straightforward to use. Using the simanneal
package has a couple of advantages over our version of simulated annealing above. First, we don't have to worry about the algorithm framework. Second, we don't have to worry about figuring out a temperature schedule. While it's possible to specify a temperature schedule, it is far easier to use the auto
scheduler and specify the approximate amount of time we'd like to wait for a solution
The package works by making an object of the Annealer
class and then calling the anneal method on that object. To set up a problem we have to set three things in our instance of the Annealer
class.
the state initializer
the move function that tells the annealing algorithm how to generate new moves
the fitness function (fitness is called energy in this package, and it was called objective in the
locsearch
package in Lesson 4).
The anneal method appears to always find minima, so you may have to negate your function if you want to find a maximum. The Github page has some short documentation about the simanneal
package.
The next cell is walkthrough of the code below.
Temperature Energy Accept Improve Elapsed Remaining
Temperature Energy Accept Improve Elapsed Remaining
0.49000 63.00 13.57% 0.00% 0:00:13 0:00:00
Self Assessment: Simulated Annealing for Value Balancing with simanneal
Revisit the Value Balancing example from Lesson 5 and solve it using simanneal
.
See if your code can find the correct answer to our 4-item/2 group problem first. Even if you don't use the debug flag, use plenty of print statements to help you follow the action at first.
When you have that working, using the following code to generate a larger list of item values. Break it into 4 groups.
Which works better - our implementation or the simanneal
package?
Simulated Annealing for Continuous Optimization
Simulated annealing was designed for combinatorial (discrete) optimization problems, but has been adapted to continuous optimization problems. The main issue is how to generate a new move at each iteration. There are many variations, but often the move is selected at random from a suitable probability distribution such as a normal or uniform distribution.
The objective functions we consider here aren't from real applications, instead they're chosen to give you an idea how the algorithm works for difficult optimization problems with many local optima. It's good to have this sort of thing in mind when, for instance, you're trying to train a complicated neural network and have to optimize the weights in the network to find the best fit to your data.
A non-convex 2D example
We found this two-dimensional example in this tutorial on simulated annealing.
Find the minimum value of
for . This function is similar to the Rastrigin function and the global minimum value is . A contour plot, shown below, illustrates that there are many local minima (in the center of many of the small loops, some correspond to local maxima). The tutorial itself is worthy of a look and has a nice flow chart outlining how simulated annealing works.
Here is a 3D plot that makes it easier to see the local minima in the search space. A local search will easily get stuck in the wrong minimum if the initial search point isn't very close to the origin.
To use simulated annealing to optimize a function with continuous variables isn't all that different than how we used it to find a good, or even optimal, tour in the traveling salesman problem.
We're going to rely on the simanneal
package in the rest of this lesson because it's much more robust than our "home-brewed" code in the simanneal_tsp()
function above. To use the package we'll have to generate an initial state, define the energy()
method for returning the objective function value, and define the move()
method for making a move from a current state to a new state.
To generate an initial state you could select uniformly distributed random numbers between -1 and 1:
We like to write functions for computing the objective function value and for making moves and then call those functions from the move()
and energy()
methods in the class definition for our problem as we did in the simanneal package TSP example above.
We already have the objective function from where we made the contour plot:
Making a move will consist of applying two functions successively. The first function adds normally distributed random numbers to each variable while the second function clips values that are out of the bounds. The scale of the move will need to be passed to the first function, while the values of the lower and upper bounds need to be passed to the clipping function. These values will need to be initialized in the __init__
constructor similar to how we worked with the distance matrix in the TSP example. Here are the functions:
Self-Assessment for Simulated Annealing with Continuous Variables
Use the objective and move functions from above to create a class using the simanneal
package. Use it to approximate the location of the global minimum for the "Bumpy" function above. You'll notice that you will usually get close to the location of the global minimum at the origin, but it won't be exact because it is very difficult to randomly move exactly to the minimum location. Typically, for continuous functions, simulated annealing is combined with local search in an iterative procedure (we'll see an example of this in the homework with the dual_annealing
optimizer from the scipy.optimize
package. For this example you could take the best state found by simulated annealing and use it to start a local search using minimize
from scipy.optimize
package.
The code from the TSP example above is a good starting point. Instead of the distance matrix you'll need to pass the scale parameter sigma to determine the size of the moves. A good starting point for sigma is range/6 where range = upper bound - lower bound. This value for sigma is because a normal distribution is about 6 * sigma wide. You may want to experiment with the value of sigma to see how it affects the result of simulated annealing. You could also play with the time allowed when you set the schedule for the annealing.
Dual_Annealing from scipy.optimize
Simulated annealing works with both continuous and discrete variables. For the case of continuous variables the dual_annealing algorithm in the scipy.optimize package tends to work rather well. It combines simulated annealing for exploration of the search space along with local search to enhance the convergence once a good minimum is found. It's pretty easy to use and documentation can be found here.. You can also execute the cell below to see the documentation:
Signature:
dual_annealing(
func,
bounds,
args=(),
maxiter=1000,
minimizer_kwargs=None,
initial_temp=5230.0,
restart_temp_ratio=2e-05,
visit=2.62,
accept=-5.0,
maxfun=10000000.0,
seed=None,
no_local_search=False,
callback=None,
x0=None,
)
Docstring:
Find the global minimum of a function using Dual Annealing.
Parameters
----------
func : callable
The objective function to be minimized. Must be in the form
``f(x, *args)``, where ``x`` is the argument in the form of a 1-D array
and ``args`` is a tuple of any additional fixed parameters needed to
completely specify the function.
bounds : sequence or `Bounds`
Bounds for variables. There are two ways to specify the bounds:
1. Instance of `Bounds` class.
2. Sequence of ``(min, max)`` pairs for each element in `x`.
args : tuple, optional
Any additional fixed parameters needed to completely specify the
objective function.
maxiter : int, optional
The maximum number of global search iterations. Default value is 1000.
minimizer_kwargs : dict, optional
Extra keyword arguments to be passed to the local minimizer
(`minimize`). Some important options could be:
``method`` for the minimizer method to use and ``args`` for
objective function additional arguments.
initial_temp : float, optional
The initial temperature, use higher values to facilitates a wider
search of the energy landscape, allowing dual_annealing to escape
local minima that it is trapped in. Default value is 5230. Range is
(0.01, 5.e4].
restart_temp_ratio : float, optional
During the annealing process, temperature is decreasing, when it
reaches ``initial_temp * restart_temp_ratio``, the reannealing process
is triggered. Default value of the ratio is 2e-5. Range is (0, 1).
visit : float, optional
Parameter for visiting distribution. Default value is 2.62. Higher
values give the visiting distribution a heavier tail, this makes
the algorithm jump to a more distant region. The value range is (1, 3].
accept : float, optional
Parameter for acceptance distribution. It is used to control the
probability of acceptance. The lower the acceptance parameter, the
smaller the probability of acceptance. Default value is -5.0 with
a range (-1e4, -5].
maxfun : int, optional
Soft limit for the number of objective function calls. If the
algorithm is in the middle of a local search, this number will be
exceeded, the algorithm will stop just after the local search is
done. Default value is 1e7.
seed : {None, int, `numpy.random.Generator`, `numpy.random.RandomState`}, optional
If `seed` is None (or `np.random`), the `numpy.random.RandomState`
singleton is used.
If `seed` is an int, a new ``RandomState`` instance is used,
seeded with `seed`.
If `seed` is already a ``Generator`` or ``RandomState`` instance then
that instance is used.
Specify `seed` for repeatable minimizations. The random numbers
generated with this seed only affect the visiting distribution function
and new coordinates generation.
no_local_search : bool, optional
If `no_local_search` is set to True, a traditional Generalized
Simulated Annealing will be performed with no local search
strategy applied.
callback : callable, optional
A callback function with signature ``callback(x, f, context)``,
which will be called for all minima found.
``x`` and ``f`` are the coordinates and function value of the
latest minimum found, and ``context`` has value in [0, 1, 2], with the
following meaning:
- 0: minimum detected in the annealing process.
- 1: detection occurred in the local search process.
- 2: detection done in the dual annealing process.
If the callback implementation returns True, the algorithm will stop.
x0 : ndarray, shape(n,), optional
Coordinates of a single N-D starting point.
Returns
-------
res : OptimizeResult
The optimization result represented as a `OptimizeResult` object.
Important attributes are: ``x`` the solution array, ``fun`` the value
of the function at the solution, and ``message`` which describes the
cause of the termination.
See `OptimizeResult` for a description of other attributes.
Notes
-----
This function implements the Dual Annealing optimization. This stochastic
approach derived from [3]_ combines the generalization of CSA (Classical
Simulated Annealing) and FSA (Fast Simulated Annealing) [1]_ [2]_ coupled
to a strategy for applying a local search on accepted locations [4]_.
An alternative implementation of this same algorithm is described in [5]_
and benchmarks are presented in [6]_. This approach introduces an advanced
method to refine the solution found by the generalized annealing
process. This algorithm uses a distorted Cauchy-Lorentz visiting
distribution, with its shape controlled by the parameter :math:`q_{v}`
.. math::
g_{q_{v}}(\Delta x(t)) \propto \frac{ \
\left[T_{q_{v}}(t) \right]^{-\frac{D}{3-q_{v}}}}{ \
\left[{1+(q_{v}-1)\frac{(\Delta x(t))^{2}} { \
\left[T_{q_{v}}(t)\right]^{\frac{2}{3-q_{v}}}}}\right]^{ \
\frac{1}{q_{v}-1}+\frac{D-1}{2}}}
Where :math:`t` is the artificial time. This visiting distribution is used
to generate a trial jump distance :math:`\Delta x(t)` of variable
:math:`x(t)` under artificial temperature :math:`T_{q_{v}}(t)`.
From the starting point, after calling the visiting distribution
function, the acceptance probability is computed as follows:
.. math::
p_{q_{a}} = \min{\{1,\left[1-(1-q_{a}) \beta \Delta E \right]^{ \
\frac{1}{1-q_{a}}}\}}
Where :math:`q_{a}` is a acceptance parameter. For :math:`q_{a}<1`, zero
acceptance probability is assigned to the cases where
.. math::
[1-(1-q_{a}) \beta \Delta E] < 0
The artificial temperature :math:`T_{q_{v}}(t)` is decreased according to
.. math::
T_{q_{v}}(t) = T_{q_{v}}(1) \frac{2^{q_{v}-1}-1}{\left( \
1 + t\right)^{q_{v}-1}-1}
Where :math:`q_{v}` is the visiting parameter.
.. versionadded:: 1.2.0
References
----------
.. [1] Tsallis C. Possible generalization of Boltzmann-Gibbs
statistics. Journal of Statistical Physics, 52, 479-487 (1998).
.. [2] Tsallis C, Stariolo DA. Generalized Simulated Annealing.
Physica A, 233, 395-406 (1996).
.. [3] Xiang Y, Sun DY, Fan W, Gong XG. Generalized Simulated
Annealing Algorithm and Its Application to the Thomson Model.
Physics Letters A, 233, 216-220 (1997).
.. [4] Xiang Y, Gong XG. Efficiency of Generalized Simulated
Annealing. Physical Review E, 62, 4473 (2000).
.. [5] Xiang Y, Gubian S, Suomela B, Hoeng J. Generalized
Simulated Annealing for Efficient Global Optimization: the GenSA
Package for R. The R Journal, Volume 5/1 (2013).
.. [6] Mullen, K. Continuous Global Optimization in R. Journal of
Statistical Software, 60(6), 1 - 45, (2014).
:doi:`10.18637/jss.v060.i06`
Examples
--------
The following example is a 10-D problem, with many local minima.
The function involved is called Rastrigin
(https://en.wikipedia.org/wiki/Rastrigin_function)
>>> import numpy as np
>>> from scipy.optimize import dual_annealing
>>> func = lambda x: np.sum(x*x - 10*np.cos(2*np.pi*x)) + 10*np.size(x)
>>> lw = [-5.12] * 10
>>> up = [5.12] * 10
>>> ret = dual_annealing(func, bounds=list(zip(lw, up)))
>>> ret.x
array([-4.26437714e-09, -3.91699361e-09, -1.86149218e-09, -3.97165720e-09,
-6.29151648e-09, -6.53145322e-09, -3.93616815e-09, -6.55623025e-09,
-6.05775280e-09, -5.00668935e-09]) # random
>>> ret.fun
0.000000
File: /usr/local/lib/python3.10/dist-packages/scipy/optimize/_dual_annealing.py
Type: function
Dual_Annealing Applied to "Bumpy"
Try executing the cell below. Note that you won't get exactly the same answer if you execute the cell repeatedly because of the random nature of simulated annealing. In any case, the global minimum function value is 0 and occurs at x = 0, y = 0. Dual_annealing finds this minimum approximately to within its default convergence settings:
Notice that the Dual_Annealing algorithm requires many fewer function evaluations than either of the simulated annealing implementations we studied above. This is because the combination of global and local search is especially efficient.
Bayesian Optimization
Bayesian Optimization is becoming quite popular in the machine learning world. The big advantage of Bayesian Optimization is that it learns as it goes, and does so more efficiently than just repeatedly solving whatever your objective function is. It does this by using a surrogate function that's generally much less expensive to evaluate than the objective function is, and using the results of that surrogate function to update the sample data and zero in on the optimal solution. Bayesian Optimization is useful when:
the objective function is very expensive to evaluate
the dimensionality of the problem (the number of input variables) is not too large, e.g.
the derivative of the objective function is unknown or unavailable (the derivate is a function which reveals the slope of the function)
you are seeking a global minimum instead of a local minimum.
the objective function may include noise so that the function does not return the same value every time it is evaluated
We will see that Bayesian Optimization gets used for Hyperparameter Optimization of machine learning models. Machine learning models can be expensive to evaluate and because they often work with random subsets of the training data they may not produce the same results on every evaluation.
Bayesian Walkthrough Video 1
In this video we talk through the main ideas in Bayesian Optimization and we go through the notebook sections below "Steps in Bayesian Optimization" through "Putting it Together by Iterating". There will be two more videos on Bayesian Optimization further below.
Steps in Bayesian Optimization
We are going to walk through the steps in Bayesian Optimization below. We'll implement each step in Python. Don't worry too much about the exact coding details since we'll ultimately use packages to automate Bayesian Optimization. Instead, just focus on developing a high-level picture of how Bayesian optimization works. If you want more details, there are many tutorials available with a bit of googling. This article gives a high-level view and talks a bit about what makes this optimization approach "Bayesian". The image below illustrates the steps. It's called "Bayesian" because a Bayesian probability framework is used to show how to update the surrogate model to reflect the knowledge gained from the additional sample point.
Define your Objective Function and Evaluate on Initial Sampling Set
The first step in any optimization problem is determining your objective function. Bayesian optimization requires an objective function in which your variables have bounds (a finite limit on each real or integer number or a defined set of values for categorical variables).
For illustration we'll work with an objective function that has one input variable. Suppose we want to minimize
is a normally distributed random variable with mean 0 and standard deviation that represents the theoretical noise level. Setting gives an ordinary function without added noise.
Here is a graph of the objective function (without noise) to help you visualize:
In practice, we don't know the actual function and since it's very expensive to evaluate we only know it at a limited sample of points. Let's say that we've evaluated the objective function at x = 0,2,4,7,9 to obtain our initial sample of points:
The five magenta points are all the information we currently know about the objective function. Note that they don't lie exactly on the graph of the objective function because we've included a bit of noise. We want to use the information from the sampled points to smartly choose another point that will, hopefully, be closer to the global minimum. We build a model of the function called a surrogate function to help choose that next point.
Building the Surrogate Function
The idea behind the surrogate is that it will be cheap to evaluate compared to the objective function so that we can use it to select an optimal next point to add to our sample points. For Bayesian Optimization the surrogate function is usually a Gaussian Process Regressor. We won't discuss the details of the model here, but it's important to know that in addition to estimating the function values, a Gaussian Process Regressor model also estimates the uncertainty of the function.
We'll use the default parameters for the GaussianProcessRegressor for our surrogate function. This function can issue warnings if we don't have a lot of data at a particular part of the distribution we're sampling, so we'll write a wrapper function to supress the warnings, and we'll return the standard deviation, which we'll use in the next step.
Like all predictive models, this model needs to be trained. We'll get to that shortly. For now, let's just define our surrogate function.
To complete the surrogate function we still need to declare a Gaussian Process Regressor object and fit it to the sample points:
To visualize we plot the actual function, the surrogate function, the sample points, and an envelope around the surrogate function that extends vertically by one (estimated) standard deviation in each direction. Notice that the uncertainty is largest when we are far from the sampled points and smallest near the sampled points.
Building the Acquisition Function
From the surrogate model we'd like to identify another point to add to our sample (another magenta point) where we'll evaluate the actual objective function. To do this we will first construct an acquisition function from the surrogate and then use the acquisition to "acquire" the next point.
Next we need what's called an acquisition function. Remember we said that Bayes Optimization learns as it goes. It does that by "acquiring" new data points. Each iteration of the algorithm will run the surrogate model and then "acquire" the sample point or points that are most likely to improve the overall results. Again, there are many ways to do approach determining what the best new points are. Three common approaches are:
Probability of Improvement (PI).
Expected Improvement (EI).
Lower Confidence Bound (LCB).
We'll demonstrate by building the LCB acquisition function. The surrogate model returns both the approximate function value as and the expected variability, the standard deviation, as . The formula for the LCB is Essentially, the LCB is found by looking standard deviations below the surrogate function values. We won't discuss the other acquisition functions here, but there are many articles that can help like this one.
Here are two examples. (We've also defined a helper function for plotting here.)
The LCB acquisition function is the red curve at the bottom of the uncertainty envelope. To acquire the next search point we seek the global minimum of the acquisition function. Note that when the acquisition (red) and surrogate function graphs are pretty similar so that when we minimize the acquisition function we get about which is close to the current minimum value at . When the new minimum occurs around because places more emphasis on the uncertainty in the surrogate model. The vertical dashed black lines show the location of the next point identified by minimizing the acquisition function.
Small values of emphasize local search (exploitation) while large values of emphasize global search (exploration).
The next step is to acquire the new point by finding or approximating the minimum of the acquisition function. Normally we'd let a Bayesian Optimization package handle the details, but we'll show one way to do it for a one-dimensional function here.
The next step would be to find evaluate the actual objective function to get values for the new sample point and then append the point, and , to the sample. Now go back to the model building step and repeat the process until happy. We'll show what that looks like for a couple of iterations with . The first cell below initializes the sample data and loads the helper functions. The second cell can be executed repeatedly to perform multiple iterations of Bayesian Optimization.
At this point we've initialized the Gaussian Process regression model using the first five sample points.
Putting it Together by Iterating
Each time you execute the next cell you'll complete one iteration of Bayesian optimization that results in new value of that will be added to the sample (the vertical black dashed line is the location of the next based on the current data.
Our from-scratch solution is not the best implementation of this algorithm. But we wanted to include it so that you could really step through each of the pieces. Instead of continuing forward and writing our own Bayesian Optimization function we'll rely on packages.
There are several packages that implement Bayes Optimization in a much more efficient way. We'll focus on Scikit-Optimize
since it's designed to work in conjunction Scikit-Learn
package for machine learning that we'll meet in Lesson 8.
Bayesian Optimization with Scikit-Optimize
The video below describes the material in this section and the next.
Scikit-Optimize expects that you are minimizing an objective function.
The function we need to use to run our Bayesian Optimization is gp_minimize. There are many parameters you can set on this function, so I encourage you to read the documentation to understand all of them. We are only going to set a few:
the function to minimize (which should return a scalar (single number) value)
the dimensions that will be searched
the acquisition function
We can also choose from LCB for lower confidence bound, EI for expected improvement, PI for probability of improvement or gp_hedge, which probabilistically chooses from one of the first three options at every iteration. This is the default option. But, we'll switch to LCB, since that's what we were using above.
the number of evaluations of the objective function (n_calls)
The function returns an "Optimization Result" object that contains the location of the minimum, the function value at the minimum, and several other values. (Don't worry if you see warnings, they seem to be harmless. We've attempted to suppress the warnings in this section.)
This package also contains some handy plots. Let's plot convergence. This will tell us how quickly the best value was achieved. In this case, we achieved our best result by the 15th iteration.
Bayesian Optimization for Multi-Dimensional Functions
The gp_minimize
algorithm can handle multi-dimensional functions, as long as what's returned from your objective function is a scalar value. Let's consider the "bumpy" function from our simulated annealing section above.
Find the minimum value of
for .
The bounds for this function are [-1,1] for both x and y, and the global minimum of 0 is at f(0,0). For plots of the "bumpy" function scroll up to the Simulated Annealing section.
We find it pretty amazing that Bayesian Optimization finds the global optimum so quickly for a function with so many local optima.
Self-Assessment - Bayesian Optimization of the Booth function
Use Bayesian Optimization (gp_minimize
) to estimate the global minimum value of the Booth function, a common optimization test problem:
for and both in [-10,10].
Tuning Hyperparameters using Bayesian Optimization
The material in this section is optional but will provide you some background for the Lesson 8 Project on tuning hyperparameters for machine learning models. There won't be any homework problems for the material in this section.. You may still wish to read the final section below for a summary of Bayesian Optimization.
This video summarizes the material in this section:
In machine learning the predictive models must be trained to give predictions that match the data as well as possible. Each model has both parameters and hyperparameters. Parameters are coefficients in the model that are determined directly from the data. Hyperparameters are values that affect the model but aren't determined directly from the data. Normally they are supplied by the scientist and sometimes tuned to produce a better fit.
Regression models often include regularization terms that help prevent the model from overfitting (memorizing) the data instead of capturing the underlying trend. We won't talk about the details of regularization here, but it's discussed in the DS740 class (here's an article if you really want to explore more). The strength of the regularization is inversely proportional to a constant (regularization ). So when C is large there is very little regularization, but when is small the regularization is large which forces the logistic regression model to depend mostly on only the important predictor variables. We're going to consider so positive corresponds to large and negative corresponds to small values of . The value of (or ) is not determined by the data but must be specified. Below, we'll show an example of using Bayesian Optimization to determine an optimal value of to maximize model accuracy. You'll be doing much more of this in Lesson 8 so don't worry too much about the details here.
Predicting Iris Species
In the classic Iris dataset we have a four predictor variables corresponding to physical measurements of the iris and a single categorical variable that is species of the iris. Three species are possible. If you've never seen this data before or want to review some logistic regression then here is a short article that may help.
To get an idea how logistic regression works here we'll load the data, train a model using the default hyperparameter values, and display the confusion matrix. We've increased the maximum iterations for the Logisitic Regression model fit to help with convergence.
If the classification were perfect there would be 50's along the diagonal and zeros elsewhere. This confusion matrix indicates that four of the 150 irises were misclassified. The accuracy is . Now we'll see if we can maximize the accuracy by using Bayesian Optimization to optimize the value of the regularization parameter . Instead of optimizing directly, we'll actually optimize and set . This is because for many small parameters it makes sense to try values like instead of (this Stackexchange post explains it nicely). The accuracy is estimated by finding the average accuracy after 10-fold cross validation. Note, we've defined our function to return the negative average accuracy for the purpose of maximization a bit later.
The estimated accuracy when (this is and is the default value) is:
Let's see how the estimated accuracy varies as changes. We'll consider corresponding to .
It appears that the maximum estimated accuracy occurs around or . We'll minimize the negative accuracy, using gp_minimize
to find the maximum estimated accuracy and the value of for which it occurs. If you haven't studied machine learning yet all you really need to understand is that to tune the hyperparameters we need to optimize a function (estimated_accuracy
) of the hyperparameters.
In the plot above we're showing negative accuracy, so more negative is good. Let's look at the confusion matrix with the optimized parameter.
While there is not a dramatic difference we can see from the confusion matrix that one more observation was classified correctly than before, so optimizing the hyperparameter produced a small benefit here. Sometimes hyperparameter optimization can make the difference between a good model and a bad one. There is one homework problem where we'll have you optimize a parameter for a Ridge Regression (Linear Regression with overfitting penalty), but we'll explore hyperparameter optimization further in Lesson 8.
Some Final Notes on Bayesian Optimization
Bayesian Optimization is expensive and should only be used if
the objective function is very expensive to evaluate
the dimensionality of the problem (the number of input variables) is not too large, e.g.
the derivative of the objective function is unknown or unavailable (the derivate is a function which reveals the slope of the function)
you are seeking a global minimum instead of a local minimum.
the objective function may include noise so that the function does not return the same value every time it is evaluated
If your objective function doesn't meet those conditions, then there are probably better optimization methods available like simulated annealing or genetic algorithms (next lesson).
Bayesian Optimization can also be extended to accommodate discrete (integer) and categorical decision variables. We won't study those extensions here, but we'll seem then in Lesson 8 where we'll see how Bayesian Optimization is used to optimize the hyperparameters of machine learning models. We'll see a lot more about this in the Lesson 8 Project.
The animated gif below is kind of fun and illustrates Bayesian Optimization for a maximization problem with two decision variables. Notice how the surrogate function (the Gaussian Process mean) becomes more and more like the actual function as sampling points are added.
This animated gif is from this Bayesian Optimization repository on Github. We haven't tried the Bayesian Optimizer in that repository, but it looks like it could be pretty good.
Interestingly, the best model fit occurs when , but the default value of is 1.0.