Path: blob/main/Homework/Lesson 08 HW - Hyper Opt Project/Homework_08.ipynb
870 views
NOTE
If you have troubles with this notebook, make sure you're using the Ubuntu 22.04 Software Environment (Settings -> Right Column) and the Python 3 (system-wide) kernel (upper right corner of notebook).
Lesson 08 Homework - Hyperparameter Optimization (Project)
When asking questions about homework in Piazza please use a tag in the subject line like HW1.3 to refer to Homework 1, Question 3. So the subject line might be HW1.3 question. Note there are no spaces in "HW1.3". This really helps keep Piazza easily searchable for everyone!
For full credit, all code in this notebook must be both executed in this notebook and copied to the Canvas quiz where indicated.
Note: This introduction is not included in the Canvas quiz.
For this project you're going to apply hyperparameter optimization to both a regression and a classification problem. It looks like a lot to do below, but it's mostly a matter of modifying code from the presentation.
Objective
For each of the models in parts 1 and 2 below, apply the following 4 tuning methods from the presentation: GridSearchCV, RandomSearchCV, BayesSearchCV, and TPOT.
For TPOT: In Part 1 you will only do hyperparameter optimization for ExtraTreesRegressor. In Part 2 you will do both hyperparameter optimization and also run TPOT and let it choose the model. See the presentation for examples of both.
Specific Quiz Questions
Follow along and use the required parameters and random seeds so that you can correctly answer the quiz questions.
Regarding data
To answer the multiple choice quiz questions, you'll need to use the data we have chosen.
We encourage you to try these out on your own data, too, to deepen your learning.
Part 1 - Optimize Extra Trees Regressor
Hints for Part 1
This section is very similar to the lesson. You should be able to mimic the lesson to finish this section!Find optimized hyperparameters for an extra trees regression model.
In the lesson, our TPOT AutoML code suggested that a viable algorithm to explore would be the ExtraTreesRegressor
. For part 1 of your homework, you'll use sklearn's ExtraTreesRegressor and attempt to optimize the hyperparameters.
You must use the diamonds data used in the presentation. You do not need to include the TPOT general search for this problem (use TPOT to optimize ExtraTreesRegressor, but don't run TPOT to choose a model). Here are ranges for a subset of the hyperparameters:
Hyperparameter | Type | Typical Range |
---|---|---|
n_estimators | discrete / integer | 10 to 150 |
min_samples_split | discrete / integer | 2 to 20 |
min_samples_leaf | discrete / integer | 1 to 10 |
max_features | discrete/integer | 1 to 30 |
bootstrap | discrete / boolean | True, False (use this order where possible) |
Note: there other hyperparameters that could be added, but we will focus on these for the project. Consult the documentation for sklearn ExtraTreesRegressor to see all of the available hyperparameters.
Question 1: Setup (1 points)
Load the diamonds dataset (diamonds_transformed.csv in the data directory).
Set up your X and y variables.
Split into 80% training data and 20% testing data.
Use random_state = 123 for reproducibility.
Use default values for the hyperparameters by not specifying values for them.
How many rows are in your training data?
Question 2 (1 points)
In the following cell, we provide you with the same my_regression_results
function we used in the lesson. Create an ExtraTreesRegressor
model using random_state=123. Fit your model. Use the my_regression_results
function to get the Root Mean Squared Error on the test data.
What is the RMSE (Root Mean Squared Error) using the default hyperparameters?
1875.57
2056.87
9688.00
1833.88
2053.20
Hints for 3
Make sure you're got the hyperparameter names spelled correctly or you'll have problems later.Question 3 (Manually Graded) (2 points)
Modify the track_results
function to work with the Extra Trees Regressor hyperparameters. Enter your results based on the default hyperparameters and display the dataframe of results.
In the Canvas quiz, copy your code and provide a screenshot of the output.
ExtraTreesRegressor Grid Search
Perform a cross-validated grid search using the following values for your hyperparameter search space.
n_estimators: [50, 100, 150]
max_features: [1, 15, 30]
min_samples_split: [2, 8]
min_samples_leaf: [1, 15]
bootstrap: [True, False]
Use the following setting in your grid search:
cv=5
Note: this may take a while on CoCalc
Be sure to track your results using your track_results
function.
Question 4: (2 points)
What is the RMSE of your optimized grid search, rounded to 2 digits?
Question 5: (2 points)
What is the optimal value of max_features chosen by the grid search?
1
15
30
Randomized Search
Use the following values to set up your randomized search space:
n_estimators: random integers between 10 and 150
min_samples_split: random integers between 2 and 20
min_samples_leaf: random integers between 1 and 20
max_features: random integers between 1 and 30
bootstrap: True or False (in that order)
Use the following settings for your randomized search:
n_iter of 25
cv of 5
random_state of 123
Question 6 (2 points)
What is the RMSE of your randomized search, rounded to 2 digits?
Question 7 (2 points)
What is the max_features chosen by your random search?
5
7
9
11
13
Bayesian Optimization
For your Bayesian Optimization, we'll use the same ranges we used in random search. You won't need to wrap any of your integer ranges in Integer(), but you will need to use Categorical([True, False])
for your bootstrap parameter.
Use the following values to set up your search space:
n_estimators: integers between 10 and 150
min_samples_split: integers between 2 and 20
min_samples_leaf: integers between 1 and 20
max_features: integers between 1 and 30
bootstrap: Categorical of True, False (in that order)
Use the following settings for your search:
n_iter of 25
cv of 5
random_state of 123
Question 8 (2 points)
What is the RMSE of your Bayesian search, rounded to 2 digits?
Question 9 (2 points)
What is the value of min_samples_leaf chosen by your Bayesian search?
1
3
5
7
9
TPOT
For TPOT, you'll use the following search config values:
n_estimators: each of the following integers - 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150
min_samples_split: all integers between 2 and 20 (inclusive of 20)
'min_samples_leaf': all integers between 1 and 20 (inclusive of 20)
'max_features': all integers between 1 and 30 (inclusive of 30)
'bootstrap': either zero or 1 ([0,1] in that order)
Use 5 generations and a population size of 10 and a cv of 3. Use random state of 123. (Note, this is not nearly enough generations or big enough population to truly find the best hyperparameters. But, we also don't want you to have to sit and watch it chug through for an hour.) We'll include stacked models here, so do not use template='Regressor'
in your configuration dictionary.
Hints for 10-11
In your TPOT configuration you have to get the "long" and "complete" module name: sklearn.ensemble.ExtraTreesRegressorQuestion 10 (2 points)
What is the RMSE of your TPOT search, rounded to 2 digits?
Question 11 (2 points)
What is the value of n_estimators chosen by your TPOT search?. More than one value may be possible since models can be nested or stacked. Check all possible values for n_estimator that occur in your TPOT pipeline. Report the results of the "inner" model in your tracking results data frame.
60
70
80
90
110
Question 12 (Manually Graded) (2 points)
Take a screenshot of your final results dataframe from the track_results
function and upload it. Briefly comment on the results.
Don't forget to comment on your results. Comments should include some information about which method you'd choose and why. Keep in mind that the best method to use depends on how long it take to fit the model. For a very expensive model (e.g. a large neural network) we might choose Bayesian Optimization, but for a cheap model we can probably afford to do an exhaustive grid search.
Part Two - Loan Classification
In part two, we'll explore optimizing hyperparameters for loan classification.
Notes:
About the data
The first cell below loads a subset of the loans default data from DS705 and your job is to predict whether a loan defaults or not. The status_bad
column is the target column and a 1 indicates a loan that defaulted. We have selected a subset of the original data that includes 2000 each of good and bad loans. The data has already been cleaned and encoded.
This is classification, not regression
The score for each model will be accuracy and not RMSE. Your summary table should include accuracy, sensitivity, and precision for each optimized model applied to the test data. (Here is a nice overview of metrics for binary classification data) that includes definitions of accuracy and such.
Load the Data
In the following cell, we load the data for you and split it into train and test dataframes. Do not change anything in this cell. (Cell not included in Canvas Quiz.)
Question 13 - Display Results Function (Manually graded)(2 points)
In the next cell, we've demonstrated using the LogisticRegression model to perform classification and generate a confusion matrix.
Hints about Confusion Matrix
You can read more about the confusion_matrix functionand a classification report function in the Scikit-learn documentation.
Both are also demonstrated in the extras folder of this lesson.
IMPORTANT: A bad loan is a "positive" in this case, since we are trying to detect bad loans.
Based on the example above write a function called my_classifier_results
modeled after my_regression_results
that applies a model to the test data and prints out the accuracy, sensitivity, precision, and the confusion matrix and returns the accuracy, sensitivity and precision. There is no need to make a plot.
Call your function using the logistic regression model we just demonstrated. (Note that your confusion matrix and accuracy should match what is shown above.) Upload the code and a screenshot of the output.
XGBoost Classifier
The algorithm that we will use to tune hyperparameters is the XGBClassifier algorithm from XGBoost. We've included the hyperparameters we'll tune and their defaults below:
Hyperparameter | Type | Default Value | Typical Range |
---|---|---|---|
n_estimators | discrete / integer | 100 | 50 to 150 |
max_depth | discrete / integer | 3 | 1 to 10 |
min_child_weight | discrete / integer | 1 | 1 to 20 |
learning_rate | continuous / float | 0.1 | 0.001 to 1 |
subsample | continuous / float | 1 | 0.05 to 1 |
reg_lambda | continuous / float | 1 | 0 to 5 |
reg_alpha | continuous / float | 0 | 0 to 5 |
Question 14 - (2 points)
Generate the Default XBGClassifer Model. Note, you'll need to pass in objective = 'binary:logistic' when you instantiate the XGBClassifier.
What is the accuracy of the default model, rounded to 3 digits?
Tracking Results Function
Create a track_results_classifier
function based on the track_results
function. You'll be tracking each of the XGBClassifier hyperparameters as well as the name of the optimization approach, accuracy, precision, recall, and number of fits.
Add the results from your default XGBClassifier model to the tracker.
(Note: this is not graded here, but the output will be graded as part of the summary.)
Grid Search for XGBClassifier
Perform a grid search using the following parameters:
learning_rate: [0.01, 0.1],
max_depth: [3, 6],
n_estimators: [10, 100],
subsample: [0.5, 1],
min_child_weight: [1, 20],
reg_lambda: [1, 3],
reg_alpha: [0, 1]
Use the following setting in your GridSearch:
cv = 3
Set the np.random.seed to 123 (done for you in the cell below).
*Note: This is a smaller than optimal grid, but we don't want you to have to wait forever to process. On CoCalc, this took about 10 minutes to run.
Question 15 (2 points)
What is the accuracy for your Grid Search, rounded to 3 digits?
Question 16 (2 points)
How many fits did your Grid Search do?
384
398
279
400
375
Random Search
Use the following parameters to generate a random search:
learning_rate: any value from the list [0.001, 0.01, 0.1, 0.5, 1.]
max_depth: any random integer between 1 and 10
n_estimators: any random integer between 50 and 150
subsample: uniform(0.05, 0.95)
min_child_weight: any random integer between 1 and 20
reg_alpha: uniform(0, 5)
reg_lambda: uniform(0, 5)
Use the following settings in your random search:
random_state = 123
n_iter = 25
cv = 3
Question 17 (2 points)
What is the accuracy for your Random Search, rounded to 3 digits?
Question 18 (2 points)
What is the learning_rate chosen by your Random Search?
.001
.01
.1
.5
.1
Bayesian Optimization
For your Bayesian Optimization, use the following parameters:
learning_rate: Any of the following values - [0.001, 0.01, 0.1, 0.5, 1.]) (Hint: you'll need to use categorical for this one)
max_depth: Any integer between 1 and 10
n_estimators: Any integer between 10 and 150
subsample: Any float between 0.05 and .95
min_child_weight: Any integer between 1 and 20
reg_alpha: Any integer between 0 and 5
reg_lambda: Any integer between 0 and 5
For your call to BayesSearchCV use the following:
random_state = 123
n_inter = 15
cv = 3
Question 19 (2 points)
What is the accuracy for your Bayes Search, rounded to 3 decimal places?
Question 20 (2 points)
What is the precision for your Bayes Search, rounded to 3 decimal places?
Genetic Algorithm from TPOT
First, you'll tune the parameters specifically for the XGBClassifier using TPOTClassifier. This will be very similar to what we did in the lesson, except there we used TPOTRegressor. Use the following parameters in your configuration:
n_estimators: allow the values in the following list [50, 75, 100]
max_depth: allow all values between 1 and 10, inclusive (remember, range does not include the highest number)
learning_rate: use the values in the following list - [1e-3, 1e-2, 1e-1, 0.5, 1.],
subsample: evenly spaced values between .05 and 1, using a step of .05 (remember that you'll need to account for the stop number not being included)
min_child_weight: allow all values between 1 and 20, inclusive
reg_alpha: allow all values between 1 and 5, inclusive
reg_lambda: allow all values between 1 and 5, inclusive
objective: set it to ['binary:logistic']
For the TPOTClassifier function use the following settings:
generations=5
population_size=20
cv=3
random_state=123
Hints for 20-21
In your TPOT configuration you have to get the "long" and "complete" module name: xgboost.XGBClassifierQuestion 21 (2 points)
What is the accuracy for your TPOT Search, rounded to 3 digits?
Question 22 (2 points)
What is the n_estimators chosen by your TPOT search?
50
75
100
AutoML with TPOT
Now that you've used TPOT to tune hyperparameters just for a single defined model (XGBoost), we're going to have you use TPOT to search for any algorithm. We refer to this as AutoML, for automated machine learning. We'll allow TPOT to find stacked models, so the hyperparameters being tuned won't be the same as the ones we've been tuning. When you store your results in your track results, you can just add 'n/a' for each of the hyperparameters.
For AutoML with TPOT, we'd like you to use the following configuration:
generations=5
population_size=30
cv=3
scoring='accuracy'
random_state=123
config_dict='TPOT light'
Note, we are using the 'TPOT light' configuration for speed here. Only models that are quick to run are included. If time was't an issue, then you would want to use the regular configuration, but we're trying to keep things simple for the homework, so stick with TPOT light.
Remember, you're using the TPOTClassifier, not the TPOTRegressor as you used in the lesson.
Question 23 (2 points)
What is the accuracy for your TPOT AutoML Search, rounded to 3 digits?
Question 24 (2 points)
Which of these pipelines did TPOT choose?
BernoulliNB, MultinomialNB, DecisionTreeClassifier and MaxABsScaler
BernouliNB, CombineDFs, and VarianceThreshold
LogisticRegression, SelectPercentile, CombineDFs
LogisticRegression, MaxAbsScalar, SelectFwe
LogisticRegression, SelectFwe, and MinMaxScaler
Question 25 - Summary (Manually Graded) (4 points)
Take a screen shot of your results tracking table and upload it. It should have the columns for the approach used, the hyperparameters chosen, the fits, the accuracy, the sensitivity, and the precision. Answer the following questions:
If the bank just wants to have the most accurate predictions, which hyperparameter optimization approach would they choose?
If the bank isn't as concerned about misclassifying some truly good loans as they are interested in correctly predicting truly bad loans. Which model should they use? Why?
Why did TPOT (not AutoML TPOT) fail to find the best hyperparameters?
Be sure to answer each question and don't forget your results dataframe screenshot.