Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
DataScienceUWL
GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Homework/Lesson 08 HW - Hyper Opt Project/Homework_08.ipynb
870 views
Kernel: Python 3 (system-wide)

NOTE

If you have troubles with this notebook, make sure you're using the Ubuntu 22.04 Software Environment (Settings -> Right Column) and the Python 3 (system-wide) kernel (upper right corner of notebook).

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import ExtraTreesRegressor from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RandomizedSearchCV from scipy.stats import uniform, randint from skopt import BayesSearchCV from skopt.space import Real, Categorical, Integer from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report from sklearn.linear_model import LogisticRegression import xgboost as xgb from tpot import TPOTClassifier from tpot import TPOTRegressor

Lesson 08 Homework - Hyperparameter Optimization (Project)

When asking questions about homework in Piazza please use a tag in the subject line like HW1.3 to refer to Homework 1, Question 3. So the subject line might be HW1.3 question. Note there are no spaces in "HW1.3". This really helps keep Piazza easily searchable for everyone!

For full credit, all code in this notebook must be both executed in this notebook and copied to the Canvas quiz where indicated.

Note: This introduction is not included in the Canvas quiz.

For this project you're going to apply hyperparameter optimization to both a regression and a classification problem. It looks like a lot to do below, but it's mostly a matter of modifying code from the presentation.

Objective

For each of the models in parts 1 and 2 below, apply the following 4 tuning methods from the presentation: GridSearchCV, RandomSearchCV, BayesSearchCV, and TPOT.

  • For TPOT: In Part 1 you will only do hyperparameter optimization for ExtraTreesRegressor. In Part 2 you will do both hyperparameter optimization and also run TPOT and let it choose the model. See the presentation for examples of both.

Specific Quiz Questions

Follow along and use the required parameters and random seeds so that you can correctly answer the quiz questions.

Regarding data

  • To answer the multiple choice quiz questions, you'll need to use the data we have chosen.

  • We encourage you to try these out on your own data, too, to deepen your learning.

Part 1 - Optimize Extra Trees Regressor

Hints for Part 1

This section is very similar to the lesson. You should be able to mimic the lesson to finish this section!

Find optimized hyperparameters for an extra trees regression model.

In the lesson, our TPOT AutoML code suggested that a viable algorithm to explore would be the ExtraTreesRegressor. For part 1 of your homework, you'll use sklearn's ExtraTreesRegressor and attempt to optimize the hyperparameters.

You must use the diamonds data used in the presentation. You do not need to include the TPOT general search for this problem (use TPOT to optimize ExtraTreesRegressor, but don't run TPOT to choose a model). Here are ranges for a subset of the hyperparameters:

HyperparameterTypeTypical Range
n_estimatorsdiscrete / integer10 to 150
min_samples_splitdiscrete / integer2 to 20
min_samples_leafdiscrete / integer1 to 10
max_featuresdiscrete/integer1 to 30
bootstrapdiscrete / booleanTrue, False (use this order where possible)

Note: there other hyperparameters that could be added, but we will focus on these for the project. Consult the documentation for sklearn ExtraTreesRegressor to see all of the available hyperparameters.

Question 1: Setup (1 points)

  • Load the diamonds dataset (diamonds_transformed.csv in the data directory).

  • Set up your X and y variables.

  • Split into 80% training data and 20% testing data.

  • Use random_state = 123 for reproducibility.

  • Use default values for the hyperparameters by not specifying values for them.

How many rows are in your training data?

Question 2 (1 points)

In the following cell, we provide you with the same my_regression_results function we used in the lesson. Create an ExtraTreesRegressor model using random_state=123. Fit your model. Use the my_regression_results function to get the Root Mean Squared Error on the test data.

What is the RMSE (Root Mean Squared Error) using the default hyperparameters?

  • 1875.57

  • 2056.87

  • 9688.00

  • 1833.88

  • 2053.20

#function to easily assess different models (not included in Canvas Quiz) def my_regression_results(model): score_test = model.score(X_test,y_test) print('Model r-squared score from test data: {:0.4f}'.format(score_test)) y_pred = model.predict(X_test) %matplotlib inline import matplotlib.pyplot as plt plt.plot(y_test,y_pred,'k.') plt.xlabel('Test Values') plt.ylabel('Predicted Values'); from sklearn.metrics import mean_squared_error mse = mean_squared_error(y_test,y_pred) rmse = np.sqrt(mse) print('Mean squared error on test data: {:0.2f}'.format(mse)) print('Root mean squared error on test data: {:0.2f}'.format(rmse)) return (round(rmse, 2))
reg = ExtraTreesRegressor(random_state=123).fit(X_train,y_train)

Hints for 3

Make sure you're got the hyperparameter names spelled correctly or you'll have problems later.

Question 3 (Manually Graded) (2 points)

Modify the track_results function to work with the Extra Trees Regressor hyperparameters. Enter your results based on the default hyperparameters and display the dataframe of results.

In the Canvas quiz, copy your code and provide a screenshot of the output.

ExtraTreesRegressor Grid Search

Perform a cross-validated grid search using the following values for your hyperparameter search space.

  • n_estimators: [50, 100, 150]

  • max_features: [1, 15, 30]

  • min_samples_split: [2, 8]

  • min_samples_leaf: [1, 15]

  • bootstrap: [True, False]

Use the following setting in your grid search:

  • cv=5

Note: this may take a while on CoCalc

Be sure to track your results using your track_results function.

Question 4: (2 points)

What is the RMSE of your optimized grid search, rounded to 2 digits?

Question 5: (2 points)

What is the optimal value of max_features chosen by the grid search?

  • 1

  • 15

  • 30

# track your results

Use the following values to set up your randomized search space:

  • n_estimators: random integers between 10 and 150

  • min_samples_split: random integers between 2 and 20

  • min_samples_leaf: random integers between 1 and 20

  • max_features: random integers between 1 and 30

  • bootstrap: True or False (in that order)

Use the following settings for your randomized search:

  • n_iter of 25

  • cv of 5

  • random_state of 123

Question 6 (2 points)

What is the RMSE of your randomized search, rounded to 2 digits?

Question 7 (2 points)

What is the max_features chosen by your random search?

  • 5

  • 7

  • 9

  • 11

  • 13

#run your code
#remember to track your results

Bayesian Optimization

For your Bayesian Optimization, we'll use the same ranges we used in random search. You won't need to wrap any of your integer ranges in Integer(), but you will need to use Categorical([True, False]) for your bootstrap parameter.

Use the following values to set up your search space:

  • n_estimators: integers between 10 and 150

  • min_samples_split: integers between 2 and 20

  • min_samples_leaf: integers between 1 and 20

  • max_features: integers between 1 and 30

  • bootstrap: Categorical of True, False (in that order)

Use the following settings for your search:

  • n_iter of 25

  • cv of 5

  • random_state of 123

Question 8 (2 points)

What is the RMSE of your Bayesian search, rounded to 2 digits?

Question 9 (2 points)

What is the value of min_samples_leaf chosen by your Bayesian search?

  • 1

  • 3

  • 5

  • 7

  • 9

#Run your Bayesian Search
#remember to track your results

TPOT

For TPOT, you'll use the following search config values:

  • n_estimators: each of the following integers - 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150

  • min_samples_split: all integers between 2 and 20 (inclusive of 20)

  • 'min_samples_leaf': all integers between 1 and 20 (inclusive of 20)

  • 'max_features': all integers between 1 and 30 (inclusive of 30)

  • 'bootstrap': either zero or 1 ([0,1] in that order)

Use 5 generations and a population size of 10 and a cv of 3. Use random state of 123. (Note, this is not nearly enough generations or big enough population to truly find the best hyperparameters. But, we also don't want you to have to sit and watch it chug through for an hour.) We'll include stacked models here, so do not use template='Regressor' in your configuration dictionary.

Hints for 10-11

In your TPOT configuration you have to get the "long" and "complete" module name: sklearn.ensemble.ExtraTreesRegressor

Question 10 (2 points)

What is the RMSE of your TPOT search, rounded to 2 digits?

Question 11 (2 points)

What is the value of n_estimators chosen by your TPOT search?. More than one value may be possible since models can be nested or stacked. Check all possible values for n_estimator that occur in your TPOT pipeline. Report the results of the "inner" model in your tracking results data frame.

  • 60

  • 70

  • 80

  • 90

  • 110

#run your TPOT code

Question 12 (Manually Graded) (2 points)

Take a screenshot of your final results dataframe from the track_results function and upload it. Briefly comment on the results.

#track results, output and screenshot

Don't forget to comment on your results. Comments should include some information about which method you'd choose and why. Keep in mind that the best method to use depends on how long it take to fit the model. For a very expensive model (e.g. a large neural network) we might choose Bayesian Optimization, but for a cheap model we can probably afford to do an exhaustive grid search.

Part Two - Loan Classification

In part two, we'll explore optimizing hyperparameters for loan classification.

Notes:

About the data

The first cell below loads a subset of the loans default data from DS705 and your job is to predict whether a loan defaults or not. The status_bad column is the target column and a 1 indicates a loan that defaulted. We have selected a subset of the original data that includes 2000 each of good and bad loans. The data has already been cleaned and encoded.

This is classification, not regression

The score for each model will be accuracy and not RMSE. Your summary table should include accuracy, sensitivity, and precision for each optimized model applied to the test data. (Here is a nice overview of metrics for binary classification data) that includes definitions of accuracy and such.

Load the Data

In the following cell, we load the data for you and split it into train and test dataframes. Do not change anything in this cell. (Cell not included in Canvas Quiz.)

# Do not change this cell for loading and preparing the data import pandas as pd import numpy as np X = pd.read_csv('./data/loans_subset.csv') # split into predictors and target # convert to numpy arrays for xgboost, OK for other models too y = np.array(X['status_Bad']) # 1 for bad loan, 0 for good loan X = np.array(X.drop(columns = ['status_Bad'])) # split into test and training data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Question 13 - Display Results Function (Manually graded)(2 points)

In the next cell, we've demonstrated using the LogisticRegression model to perform classification and generate a confusion matrix.

Hints about Confusion Matrix

You can read more about the confusion_matrix function

and a classification report function in the Scikit-learn documentation.

Both are also demonstrated in the extras folder of this lesson.

IMPORTANT: A bad loan is a "positive" in this case, since we are trying to detect bad loans.

# may get a warning here. the data should probably be scaled but that's an issue for another lesson :) logreg_model = LogisticRegression(solver='lbfgs',max_iter=2000) #fit the model logreg_model.fit(X_train, y_train) # Use score method to get accuracy of model score = logreg_model.score(X_test, y_test) # this is accuracy print(f'The accuracy is {score}') # obtaining the confusion matrix and making it look nice y_pred = logreg_model.predict(X_test) # must put true before predictions in confusion matrix function cmtx = pd.DataFrame( confusion_matrix(y_test, y_pred, labels=[1,0]), index=['true:bad', 'true:good'], columns=['pred:bad','pred:good'] )

Based on the example above write a function called my_classifier_results modeled after my_regression_results that applies a model to the test data and prints out the accuracy, sensitivity, precision, and the confusion matrix and returns the accuracy, sensitivity and precision. There is no need to make a plot.

Call your function using the logistic regression model we just demonstrated. (Note that your confusion matrix and accuracy should match what is shown above.) Upload the code and a screenshot of the output.

# Solution for 13 def my_classifier_results(model): #add your code here

XGBoost Classifier

The algorithm that we will use to tune hyperparameters is the XGBClassifier algorithm from XGBoost. We've included the hyperparameters we'll tune and their defaults below:

HyperparameterTypeDefault ValueTypical Range
n_estimatorsdiscrete / integer10050 to 150
max_depthdiscrete / integer31 to 10
min_child_weightdiscrete / integer11 to 20
learning_ratecontinuous / float0.10.001 to 1
subsamplecontinuous / float10.05 to 1
reg_lambdacontinuous / float10 to 5
reg_alphacontinuous / float00 to 5

Question 14 - (2 points)

Generate the Default XBGClassifer Model. Note, you'll need to pass in objective = 'binary:logistic' when you instantiate the XGBClassifier.

What is the accuracy of the default model, rounded to 3 digits?

# use these as the defaults. they may not agree with the latest documentation, but they're fine for our purposes. # pass them to the model as **param_defaults as in the lesson param_defaults = { 'objective':'binary:logistic', 'n_estimators': 100, 'max_depth': 3, 'min_child_weight': 1, 'learning_rate': .1, 'subsample': 1, 'reg_lambda': 1, 'reg_alpha': 0 }

Tracking Results Function

Create a track_results_classifier function based on the track_results function. You'll be tracking each of the XGBClassifier hyperparameters as well as the name of the optimization approach, accuracy, precision, recall, and number of fits.

Add the results from your default XGBClassifier model to the tracker.

(Note: this is not graded here, but the output will be graded as part of the summary.)

Grid Search for XGBClassifier

Perform a grid search using the following parameters:

  • learning_rate: [0.01, 0.1],

  • max_depth: [3, 6],

  • n_estimators: [10, 100],

  • subsample: [0.5, 1],

  • min_child_weight: [1, 20],

  • reg_lambda: [1, 3],

  • reg_alpha: [0, 1]

Use the following setting in your GridSearch:

  • cv = 3

Set the np.random.seed to 123 (done for you in the cell below).

*Note: This is a smaller than optimal grid, but we don't want you to have to wait forever to process. On CoCalc, this took about 10 minutes to run.

Question 15 (2 points)

What is the accuracy for your Grid Search, rounded to 3 digits?

Question 16 (2 points)

How many fits did your Grid Search do?

  • 384

  • 398

  • 279

  • 400

  • 375

np.random.seed(123)
#don't forget to track your results

Use the following parameters to generate a random search:

  • learning_rate: any value from the list [0.001, 0.01, 0.1, 0.5, 1.]

  • max_depth: any random integer between 1 and 10

  • n_estimators: any random integer between 50 and 150

  • subsample: uniform(0.05, 0.95)

  • min_child_weight: any random integer between 1 and 20

  • reg_alpha: uniform(0, 5)

  • reg_lambda: uniform(0, 5)

Use the following settings in your random search:

  • random_state = 123

  • n_iter = 25

  • cv = 3

Question 17 (2 points)

What is the accuracy for your Random Search, rounded to 3 digits?

Question 18 (2 points)

What is the learning_rate chosen by your Random Search?

  • .001

  • .01

  • .1

  • .5

  • .1

#run your random search
#don't forget to track your results

Bayesian Optimization

For your Bayesian Optimization, use the following parameters:

  • learning_rate: Any of the following values - [0.001, 0.01, 0.1, 0.5, 1.]) (Hint: you'll need to use categorical for this one)

  • max_depth: Any integer between 1 and 10

  • n_estimators: Any integer between 10 and 150

  • subsample: Any float between 0.05 and .95

  • min_child_weight: Any integer between 1 and 20

  • reg_alpha: Any integer between 0 and 5

  • reg_lambda: Any integer between 0 and 5

For your call to BayesSearchCV use the following:

  • random_state = 123

  • n_inter = 15

  • cv = 3

Question 19 (2 points)

What is the accuracy for your Bayes Search, rounded to 3 decimal places?

Question 20 (2 points)

What is the precision for your Bayes Search, rounded to 3 decimal places?

#run your Bayesian Search
#don't forget to track your results

Genetic Algorithm from TPOT

First, you'll tune the parameters specifically for the XGBClassifier using TPOTClassifier. This will be very similar to what we did in the lesson, except there we used TPOTRegressor. Use the following parameters in your configuration:

  • n_estimators: allow the values in the following list [50, 75, 100]

  • max_depth: allow all values between 1 and 10, inclusive (remember, range does not include the highest number)

  • learning_rate: use the values in the following list - [1e-3, 1e-2, 1e-1, 0.5, 1.],

  • subsample: evenly spaced values between .05 and 1, using a step of .05 (remember that you'll need to account for the stop number not being included)

  • min_child_weight: allow all values between 1 and 20, inclusive

  • reg_alpha: allow all values between 1 and 5, inclusive

  • reg_lambda: allow all values between 1 and 5, inclusive

  • objective: set it to ['binary:logistic']

For the TPOTClassifier function use the following settings:

  • generations=5

  • population_size=20

  • cv=3

  • random_state=123

Hints for 20-21

In your TPOT configuration you have to get the "long" and "complete" module name: xgboost.XGBClassifier

Question 21 (2 points)

What is the accuracy for your TPOT Search, rounded to 3 digits?

Question 22 (2 points)

What is the n_estimators chosen by your TPOT search?

  • 50

  • 75

  • 100

#run TPOT for XGBClassifier
#don't forget to track your results

AutoML with TPOT

Now that you've used TPOT to tune hyperparameters just for a single defined model (XGBoost), we're going to have you use TPOT to search for any algorithm. We refer to this as AutoML, for automated machine learning. We'll allow TPOT to find stacked models, so the hyperparameters being tuned won't be the same as the ones we've been tuning. When you store your results in your track results, you can just add 'n/a' for each of the hyperparameters.

For AutoML with TPOT, we'd like you to use the following configuration:

  • generations=5

  • population_size=30

  • cv=3

  • scoring='accuracy'

  • random_state=123

  • config_dict='TPOT light'

Note, we are using the 'TPOT light' configuration for speed here. Only models that are quick to run are included. If time was't an issue, then you would want to use the regular configuration, but we're trying to keep things simple for the homework, so stick with TPOT light.

Remember, you're using the TPOTClassifier, not the TPOTRegressor as you used in the lesson.

Question 23 (2 points)

What is the accuracy for your TPOT AutoML Search, rounded to 3 digits?

Question 24 (2 points)

Which of these pipelines did TPOT choose?

  • BernoulliNB, MultinomialNB, DecisionTreeClassifier and MaxABsScaler

  • BernouliNB, CombineDFs, and VarianceThreshold

  • LogisticRegression, SelectPercentile, CombineDFs

  • LogisticRegression, MaxAbsScalar, SelectFwe

  • LogisticRegression, SelectFwe, and MinMaxScaler

#Run TPOT automl
#don't forget to track your results. You may use 'n/a' for all the hyperparameters, since we're not going to end up with the same model

Question 25 - Summary (Manually Graded) (4 points)

Take a screen shot of your results tracking table and upload it. It should have the columns for the approach used, the hyperparameters chosen, the fits, the accuracy, the sensitivity, and the precision. Answer the following questions:

If the bank just wants to have the most accurate predictions, which hyperparameter optimization approach would they choose?

If the bank isn't as concerned about misclassifying some truly good loans as they are interested in correctly predicting truly bad loans. Which model should they use? Why?

Why did TPOT (not AutoML TPOT) fail to find the best hyperparameters?

Be sure to answer each question and don't forget your results dataframe screenshot.