GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Homework/Lesson 08 HW - Hyper Opt Project/Homework_08.ipynb
⁸⁷⁰ views

Kernel: Python 3 (system-wide)

NOTE

If you have troubles with this notebook, make sure you're using the Ubuntu 22.04 Software Environment (Settings -> Right Column) and the Python 3 (system-wide) kernel (upper right corner of notebook).

In [3]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from tpot import TPOTClassifier
from tpot import TPOTRegressor

Lesson 08 Homework - Hyperparameter Optimization (Project)

When asking questions about homework in Piazza please use a tag in the subject line like HW1.3 to refer to Homework 1, Question 3. So the subject line might be HW1.3 question. Note there are no spaces in "HW1.3". This really helps keep Piazza easily searchable for everyone!

For full credit, all code in this notebook must be both executed in this notebook and copied to the Canvas quiz where indicated.

Note: This introduction is not included in the Canvas quiz.

For this project you're going to apply hyperparameter optimization to both a regression and a classification problem. It looks like a lot to do below, but it's mostly a matter of modifying code from the presentation.

Objective

For each of the models in parts 1 and 2 below, apply the following 4 tuning methods from the presentation: GridSearchCV, RandomSearchCV, BayesSearchCV, and TPOT.

For TPOT: In Part 1 you will only do hyperparameter optimization for ExtraTreesRegressor. In Part 2 you will do both hyperparameter optimization and also run TPOT and let it choose the model. See the presentation for examples of both.

Specific Quiz Questions

Follow along and use the required parameters and random seeds so that you can correctly answer the quiz questions.

Regarding data

To answer the multiple choice quiz questions, you'll need to use the data we have chosen.
We encourage you to try these out on your own data, too, to deepen your learning.

Part 1 - Optimize Extra Trees Regressor

Hints for Part 1

This section is very similar to the lesson. You should be able to mimic the lesson to finish this section!

Find optimized hyperparameters for an extra trees regression model.

In the lesson, our TPOT AutoML code suggested that a viable algorithm to explore would be the ExtraTreesRegressor. For part 1 of your homework, you'll use sklearn's ExtraTreesRegressor and attempt to optimize the hyperparameters.

You must use the diamonds data used in the presentation. You do not need to include the TPOT general search for this problem (use TPOT to optimize ExtraTreesRegressor, but don't run TPOT to choose a model). Here are ranges for a subset of the hyperparameters:

Hyperparameter	Type	Typical Range
n_estimators	discrete / integer	10 to 150
min_samples_split	discrete / integer	2 to 20
min_samples_leaf	discrete / integer	1 to 10
max_features	discrete/integer	1 to 30
bootstrap	discrete / boolean	True, False (use this order where possible)

Note: there other hyperparameters that could be added, but we will focus on these for the project. Consult the documentation for sklearn ExtraTreesRegressor to see all of the available hyperparameters.

Question 1: Setup (1 points)

Load the diamonds dataset (diamonds_transformed.csv in the data directory).
Set up your X and y variables.
Split into 80% training data and 20% testing data.
Use random_state = 123 for reproducibility.
Use default values for the hyperparameters by not specifying values for them.

How many rows are in your training data?

In [0]:

Question 2 (1 points)

In the following cell, we provide you with the same my_regression_results function we used in the lesson. Create an ExtraTreesRegressor model using random_state=123. Fit your model. Use the my_regression_results function to get the Root Mean Squared Error on the test data.

What is the RMSE (Root Mean Squared Error) using the default hyperparameters?

1875.57
2056.87
9688.00
1833.88
2053.20

In [2]:

#function to easily assess different models (not included in Canvas Quiz)
def my_regression_results(model):
    score_test = model.score(X_test,y_test)
    print('Model r-squared score from test data: {:0.4f}'.format(score_test))

    y_pred = model.predict(X_test)
    %matplotlib inline
    import matplotlib.pyplot as plt
    plt.plot(y_test,y_pred,'k.')
    plt.xlabel('Test Values')
    plt.ylabel('Predicted Values');

    from sklearn.metrics import mean_squared_error
    mse = mean_squared_error(y_test,y_pred)
    rmse = np.sqrt(mse)
    print('Mean squared error on test data: {:0.2f}'.format(mse))
    print('Root mean squared error on test data: {:0.2f}'.format(rmse))
    return (round(rmse, 2))

In [0]:

reg = ExtraTreesRegressor(random_state=123).fit(X_train,y_train)

Hints for 3

Make sure you're got the hyperparameter names spelled correctly or you'll have problems later.

Question 3 (Manually Graded) (2 points)

Modify the track_results function to work with the Extra Trees Regressor hyperparameters. Enter your results based on the default hyperparameters and display the dataframe of results.

In the Canvas quiz, copy your code and provide a screenshot of the output.

In [0]:

ExtraTreesRegressor Grid Search

Perform a cross-validated grid search using the following values for your hyperparameter search space.

n_estimators: [50, 100, 150]
max_features: [1, 15, 30]
min_samples_split: [2, 8]
min_samples_leaf: [1, 15]
bootstrap: [True, False]

Use the following setting in your grid search:

cv=5

Note: this may take a while on CoCalc

Be sure to track your results using your track_results function.

Question 4: (2 points)

What is the RMSE of your optimized grid search, rounded to 2 digits?

Question 5: (2 points)

What is the optimal value of max_features chosen by the grid search?

In [0]:

In [0]:

# track your results

Randomized Search

Use the following values to set up your randomized search space:

n_estimators: random integers between 10 and 150
min_samples_split: random integers between 2 and 20
min_samples_leaf: random integers between 1 and 20
max_features: random integers between 1 and 30
bootstrap: True or False (in that order)

Use the following settings for your randomized search:

n_iter of 25
cv of 5
random_state of 123

Question 6 (2 points)

What is the RMSE of your randomized search, rounded to 2 digits?

Question 7 (2 points)

What is the max_features chosen by your random search?

In [0]:

#run your code

In [0]:

#remember to track your results

Bayesian Optimization

For your Bayesian Optimization, we'll use the same ranges we used in random search. You won't need to wrap any of your integer ranges in Integer(), but you will need to use Categorical([True, False]) for your bootstrap parameter.

Use the following values to set up your search space:

n_estimators: integers between 10 and 150
min_samples_split: integers between 2 and 20
min_samples_leaf: integers between 1 and 20
max_features: integers between 1 and 30
bootstrap: Categorical of True, False (in that order)

Use the following settings for your search:

n_iter of 25
cv of 5
random_state of 123

Question 8 (2 points)

What is the RMSE of your Bayesian search, rounded to 2 digits?

Question 9 (2 points)

What is the value of min_samples_leaf chosen by your Bayesian search?

In [0]:

#Run your Bayesian Search

In [0]:

#remember to track your results

TPOT

For TPOT, you'll use the following search config values:

n_estimators: each of the following integers - 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150
min_samples_split: all integers between 2 and 20 (inclusive of 20)
'min_samples_leaf': all integers between 1 and 20 (inclusive of 20)
'max_features': all integers between 1 and 30 (inclusive of 30)
'bootstrap': either zero or 1 ([0,1] in that order)

Use 5 generations and a population size of 10 and a cv of 3. Use random state of 123. (Note, this is not nearly enough generations or big enough population to truly find the best hyperparameters. But, we also don't want you to have to sit and watch it chug through for an hour.) We'll include stacked models here, so do not use template='Regressor' in your configuration dictionary.

Hints for 10-11

In your TPOT configuration you have to get the "long" and "complete" module name: sklearn.ensemble.ExtraTreesRegressor

Question 10 (2 points)

What is the RMSE of your TPOT search, rounded to 2 digits?

Question 11 (2 points)

What is the value of n_estimators chosen by your TPOT search?. More than one value may be possible since models can be nested or stacked. Check all possible values for n_estimator that occur in your TPOT pipeline. Report the results of the "inner" model in your tracking results data frame.

In [0]:

#run your TPOT code

Question 12 (Manually Graded) (2 points)

Take a screenshot of your final results dataframe from the track_results function and upload it. Briefly comment on the results.

In [0]:

#track results, output and screenshot

Don't forget to comment on your results. Comments should include some information about which method you'd choose and why. Keep in mind that the best method to use depends on how long it take to fit the model. For a very expensive model (e.g. a large neural network) we might choose Bayesian Optimization, but for a cheap model we can probably afford to do an exhaustive grid search.

Part Two - Loan Classification

In part two, we'll explore optimizing hyperparameters for loan classification.

Notes:

About the data

The first cell below loads a subset of the loans default data from DS705 and your job is to predict whether a loan defaults or not. The status_bad column is the target column and a 1 indicates a loan that defaulted. We have selected a subset of the original data that includes 2000 each of good and bad loans. The data has already been cleaned and encoded.

This is classification, not regression

The score for each model will be accuracy and not RMSE. Your summary table should include accuracy, sensitivity, and precision for each optimized model applied to the test data. (Here is a nice overview of metrics for binary classification data) that includes definitions of accuracy and such.

Load the Data

In the following cell, we load the data for you and split it into train and test dataframes. Do not change anything in this cell. (Cell not included in Canvas Quiz.)

In [0]:

# Do not change this cell for loading and preparing the data
import pandas as pd
import numpy as np

X = pd.read_csv('./data/loans_subset.csv')

# split into predictors and target
# convert to numpy arrays for xgboost, OK for other models too
y = np.array(X['status_Bad']) # 1 for bad loan, 0 for good loan
X = np.array(X.drop(columns = ['status_Bad']))

# split into test and training data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Question 13 - Display Results Function (Manually graded)(2 points)

In the next cell, we've demonstrated using the LogisticRegression model to perform classification and generate a confusion matrix.

Hints about Confusion Matrix

You can read more about the confusion_matrix function

and a classification report function in the Scikit-learn documentation.

Both are also demonstrated in the extras folder of this lesson.

IMPORTANT: A bad loan is a "positive" in this case, since we are trying to detect bad loans.

In [0]:

# may get a warning here.  the data should probably be scaled but that's an issue for another lesson :)

logreg_model = LogisticRegression(solver='lbfgs',max_iter=2000)

#fit the model
logreg_model.fit(X_train, y_train)

# Use score method to get accuracy of model
score = logreg_model.score(X_test, y_test) # this is accuracy
print(f'The accuracy is {score}')

# obtaining the confusion matrix and making it look nice
y_pred = logreg_model.predict(X_test)

# must put true before predictions in confusion matrix function
cmtx = pd.DataFrame(
    confusion_matrix(y_test, y_pred, labels=[1,0]), 
    index=['true:bad', 'true:good'],
    columns=['pred:bad','pred:good']
)

Based on the example above write a function called my_classifier_results modeled after my_regression_results that applies a model to the test data and prints out the accuracy, sensitivity, precision, and the confusion matrix and returns the accuracy, sensitivity and precision. There is no need to make a plot.

Call your function using the logistic regression model we just demonstrated. (Note that your confusion matrix and accuracy should match what is shown above.) Upload the code and a screenshot of the output.

In [0]:

# Solution for 13
def my_classifier_results(model):
    #add your code here

XGBoost Classifier

The algorithm that we will use to tune hyperparameters is the XGBClassifier algorithm from XGBoost. We've included the hyperparameters we'll tune and their defaults below:

Hyperparameter	Type	Default Value	Typical Range
n_estimators	discrete / integer	100	50 to 150
max_depth	discrete / integer	3	1 to 10
min_child_weight	discrete / integer	1	1 to 20
learning_rate	continuous / float	0.1	0.001 to 1
subsample	continuous / float	1	0.05 to 1
reg_lambda	continuous / float	1	0 to 5
reg_alpha	continuous / float	0	0 to 5

Question 14 - (2 points)

Generate the Default XBGClassifer Model. Note, you'll need to pass in objective = 'binary:logistic' when you instantiate the XGBClassifier.

What is the accuracy of the default model, rounded to 3 digits?

In [0]:

# use these as the defaults.  they may not agree with the latest documentation, but they're fine for our purposes.
# pass them to the model as **param_defaults as in the lesson

param_defaults = {
    'objective':'binary:logistic',
    'n_estimators': 100,
    'max_depth': 3,
    'min_child_weight': 1,
    'learning_rate': .1,
    'subsample': 1,
    'reg_lambda': 1,
    'reg_alpha': 0
}

Tracking Results Function

Create a track_results_classifier function based on the track_results function. You'll be tracking each of the XGBClassifier hyperparameters as well as the name of the optimization approach, accuracy, precision, recall, and number of fits.

Add the results from your default XGBClassifier model to the tracker.

(Note: this is not graded here, but the output will be graded as part of the summary.)

In [0]:

Grid Search for XGBClassifier

Perform a grid search using the following parameters:

learning_rate: [0.01, 0.1],
max_depth: [3, 6],
n_estimators: [10, 100],
subsample: [0.5, 1],
min_child_weight: [1, 20],
reg_lambda: [1, 3],
reg_alpha: [0, 1]

Use the following setting in your GridSearch:

cv = 3

Set the np.random.seed to 123 (done for you in the cell below).

*Note: This is a smaller than optimal grid, but we don't want you to have to wait forever to process. On CoCalc, this took about 10 minutes to run.

Question 15 (2 points)

What is the accuracy for your Grid Search, rounded to 3 digits?

Question 16 (2 points)

How many fits did your Grid Search do?

In [0]:

np.random.seed(123)

In [0]:

#don't forget to track your results

Random Search

Use the following parameters to generate a random search:

learning_rate: any value from the list [0.001, 0.01, 0.1, 0.5, 1.]
max_depth: any random integer between 1 and 10
n_estimators: any random integer between 50 and 150
subsample: uniform(0.05, 0.95)
min_child_weight: any random integer between 1 and 20
reg_alpha: uniform(0, 5)
reg_lambda: uniform(0, 5)

Use the following settings in your random search:

random_state = 123
n_iter = 25
cv = 3

Question 17 (2 points)

What is the accuracy for your Random Search, rounded to 3 digits?

Question 18 (2 points)

What is the learning_rate chosen by your Random Search?

.001
.01
.1
.5
.1

In [0]:

#run your random search

In [0]:

#don't forget to track your results

Bayesian Optimization

For your Bayesian Optimization, use the following parameters:

learning_rate: Any of the following values - [0.001, 0.01, 0.1, 0.5, 1.]) (Hint: you'll need to use categorical for this one)
max_depth: Any integer between 1 and 10
n_estimators: Any integer between 10 and 150
subsample: Any float between 0.05 and .95
min_child_weight: Any integer between 1 and 20
reg_alpha: Any integer between 0 and 5
reg_lambda: Any integer between 0 and 5

For your call to BayesSearchCV use the following:

random_state = 123
n_inter = 15
cv = 3

Question 19 (2 points)

What is the accuracy for your Bayes Search, rounded to 3 decimal places?

Question 20 (2 points)

What is the precision for your Bayes Search, rounded to 3 decimal places?

In [0]:

#run your Bayesian Search

In [0]:

#don't forget to track your results

Genetic Algorithm from TPOT

First, you'll tune the parameters specifically for the XGBClassifier using TPOTClassifier. This will be very similar to what we did in the lesson, except there we used TPOTRegressor. Use the following parameters in your configuration:

n_estimators: allow the values in the following list [50, 75, 100]
max_depth: allow all values between 1 and 10, inclusive (remember, range does not include the highest number)
learning_rate: use the values in the following list - [1e-3, 1e-2, 1e-1, 0.5, 1.],
subsample: evenly spaced values between .05 and 1, using a step of .05 (remember that you'll need to account for the stop number not being included)
min_child_weight: allow all values between 1 and 20, inclusive
reg_alpha: allow all values between 1 and 5, inclusive
reg_lambda: allow all values between 1 and 5, inclusive
objective: set it to ['binary:logistic']

For the TPOTClassifier function use the following settings:

generations=5
population_size=20
cv=3
random_state=123

Hints for 20-21

In your TPOT configuration you have to get the "long" and "complete" module name: xgboost.XGBClassifier

Question 21 (2 points)

What is the accuracy for your TPOT Search, rounded to 3 digits?

Question 22 (2 points)

What is the n_estimators chosen by your TPOT search?

In [0]:

#run TPOT for XGBClassifier

In [0]:

#don't forget to track your results

AutoML with TPOT

Now that you've used TPOT to tune hyperparameters just for a single defined model (XGBoost), we're going to have you use TPOT to search for any algorithm. We refer to this as AutoML, for automated machine learning. We'll allow TPOT to find stacked models, so the hyperparameters being tuned won't be the same as the ones we've been tuning. When you store your results in your track results, you can just add 'n/a' for each of the hyperparameters.

For AutoML with TPOT, we'd like you to use the following configuration:

generations=5
population_size=30
cv=3
scoring='accuracy'
random_state=123
config_dict='TPOT light'

Note, we are using the 'TPOT light' configuration for speed here. Only models that are quick to run are included. If time was't an issue, then you would want to use the regular configuration, but we're trying to keep things simple for the homework, so stick with TPOT light.

Remember, you're using the TPOTClassifier, not the TPOTRegressor as you used in the lesson.

Question 23 (2 points)

What is the accuracy for your TPOT AutoML Search, rounded to 3 digits?

Question 24 (2 points)

Which of these pipelines did TPOT choose?

BernoulliNB, MultinomialNB, DecisionTreeClassifier and MaxABsScaler
BernouliNB, CombineDFs, and VarianceThreshold
LogisticRegression, SelectPercentile, CombineDFs
LogisticRegression, MaxAbsScalar, SelectFwe
LogisticRegression, SelectFwe, and MinMaxScaler

In [0]:

#Run TPOT automl

In [0]:

#don't forget to track your results. You may use 'n/a' for all the hyperparameters, since we're not going to end up with the same model

Question 25 - Summary (Manually Graded) (4 points)

Take a screen shot of your results tracking table and upload it. It should have the columns for the approach used, the hyperparameters chosen, the fits, the accuracy, the sensitivity, and the precision. Answer the following questions:

If the bank just wants to have the most accurate predictions, which hyperparameter optimization approach would they choose?

If the bank isn't as concerned about misclassifying some truly good loans as they are interested in correctly predicting truly bad loans. Which model should they use? Why?

Why did TPOT (not AutoML TPOT) fail to find the best hyperparameters?

Be sure to answer each question and don't forget your results dataframe screenshot.

In [0]: