Path: blob/master/lessons/lesson_08/code/solution-code/LogisticRegression-BankMarketing-Lab-solutions.ipynb
1904 views
Logistic Regresion Lab
Exercise with bank marketing data
Authors: Sam Stack(DC)
Introduction
Data from the UCI Machine Learning Repository: data, data dictionary
Goal: Predict whether a customer will purchase a bank product marketed over the phone
bank-additional.csv
is already in our repo, so there is no need to download the data from the UCI website
Step 1: Read the data into Pandas
** Target 'y
' represented as such** - No : 0 - Yes : 1
Step 2: Prepare at least three features
Include both numeric and categorical features
Choose features that you think might be related to the response (based on intuition or exploration)
Think about how to handle missing values (encoded as "unknown")
Qualitative data analysis
So I have some unknown values in education
, marital
and employment
. We could make assumptions that the 39 unkown from employment
are most likely in admin
professions or that the 11 unknown in marital
are most likely married
(unfortunate that they are uncertain about it).
Personally, im going to drop the unknowns as I do not want to encorporate any addition bias into the data itself.
Going forward a more sound method of replacing unknowns is to build models to predict them using K Nearest neighbors, that way you are filling in an unknown using the most similar observations you have.
My data is read to get dummied, but i'll wait until im about to model
Step 3: Model building
Use cross-validation to evaluate the logistic regression model with your chosen features. You can use any (combination) of the following metrics to evaluate.
Try to increase the AUC by selecting different sets of features
Bonus: Experiment with hyper parameters such are regularization.
Build a Model
Model 1, using age
, job
, education
, and day_of_week
Get the Coefficient for each feature.
Be sure to make note of interesting findings.
Seems like job_entrepreneur
carries that largest coef.
Use the Model to predict on x_test and evaluate the model using metric(s) of Choice.
** Accuracy Score**
Wow thats a pretty good score wouldn't you say? Almost 90! Remember the distribution of classes though. In our entire dataset there are 3668 "No" and 451 "Yes" and a total of 4119 observations. If we guessed that nobody was going to convert and therefore 'No' every time, we would be correct 89% of the time (according to out data). That being said, this accuracy is barely better than baseline and such an insignificant difference could just be from how our train test split groupped the data.
Confusion Matrix
Looks like we have 880 True Negatives and 99 False Negatives. That being said it looks like all our model is doing is predicting 'no' everytime.
** ROC AUC**
The Area Under the ROC Curve is 0.5 which is completely wothless and our model gains no more insight that random guessing. If we go back to the Accuracy score, we can now conclude that its minuscule improvement above the baseline is caused by our train test split.
Log Loss
Model 2: Using age
, job
, marital
, education
, contact
and day_of_week
to predict If the bought or not.
None of the metrics really changed. Looks like the features we have arn't very helpful...
Is your model not performing very well?
Lets try one more thing before we revert to grabbing more features. Adjusting the probability threshold.
Use the LogisticRegression.predict_proba()
attribute to get the probabilities.
Recall from the lesson the first probability is the for class 0 and the second is for class 1
Visualize the distribution
** Calculate a new threshold and use it to convert predicted probabilities to output classes**
Lets try decreaseing the threshold to %20 predicted probability or higher.
Evaluate the model metrics now