Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/code/solution-code/LogisticRegression-BankMarketing-Lab-solutions.ipynb
1904 views
Kernel: Python 2

Logistic Regresion Lab

Exercise with bank marketing data

Authors: Sam Stack(DC)

Introduction

  • Data from the UCI Machine Learning Repository: data, data dictionary

  • Goal: Predict whether a customer will purchase a bank product marketed over the phone

  • bank-additional.csv is already in our repo, so there is no need to download the data from the UCI website

Step 1: Read the data into Pandas

import pandas as pd bank = pd.read_csv('../../data/bank.csv') bank.head()

** Target 'y' represented as such** - No : 0 - Yes : 1

# check the results of y bank['y'].value_counts()
0 3668 1 451 Name: y, dtype: int64

Step 2: Prepare at least three features

  • Include both numeric and categorical features

  • Choose features that you think might be related to the response (based on intuition or exploration)

  • Think about how to handle missing values (encoded as "unknown")

# Im going to take about 6 features and build two separate models. # Age, Job, Marital, education, contact, day of week. # A correlation matrix or heat map is probably beneficial to finding useful features. # This can be difficult with the amount of categorical features in the data. # Once converted to dummie variables that will still be a computationally expensive process # to compare all features. # there was no formal eda behind my selection, I just wanted to use random features.
features = ['age','job','marital','education','contact','day_of_week','y'] for feat in features: if feat != 'age': print(bank[feat].value_counts())
admin. 1012 blue-collar 884 technician 691 services 393 management 324 retired 166 self-employed 159 entrepreneur 148 unemployed 111 housemaid 110 student 82 unknown 39 Name: job, dtype: int64 married 2509 single 1153 divorced 446 unknown 11 Name: marital, dtype: int64 university.degree 1264 high.school 921 basic.9y 574 professional.course 535 basic.4y 429 basic.6y 228 unknown 167 illiterate 1 Name: education, dtype: int64 cellular 2652 telephone 1467 Name: contact, dtype: int64 thu 860 mon 855 tue 841 wed 795 fri 768 Name: day_of_week, dtype: int64 0 3668 1 451 Name: y, dtype: int64

Qualitative data analysis So I have some unknown values in education, marital and employment. We could make assumptions that the 39 unkown from employment are most likely in admin professions or that the 11 unknown in marital are most likely married (unfortunate that they are uncertain about it).

Personally, im going to drop the unknowns as I do not want to encorporate any addition bias into the data itself.

  • Going forward a more sound method of replacing unknowns is to build models to predict them using K Nearest neighbors, that way you are filling in an unknown using the most similar observations you have.

# creating the sub dataframe with only the features im using bank_a = bank[features] # getting rid of unknowns bank_a = bank_a[bank_a['education'] != 'unknown'] bank_a = bank_a[bank_a['job'] != 'unknown'] bank_a = bank_a[bank_a['marital'] != 'unknown']

My data is read to get dummied, but i'll wait until im about to model

Step 3: Model building

from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn import metrics

Build a Model Model 1, using age, job, education, and day_of_week

# md = ModelData. Dummies ignores numeric columns such as age and y bank_md1 = pd.get_dummies(bank_a[['age','job','education','day_of_week','y']], drop_first = True) bank # no hyper parameters for first model LogReg1 = LogisticRegression() # X and y features X1 = bank_md1.drop('y', axis =1) y1 = bank_md1['y'] # using train test split to cross val x_train1, x_test1, y_train1, y_test1 = train_test_split(X1,y1, random_state =42) # fit model LogReg1.fit(x_train1, y_train1)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Get the Coefficient for each feature.

  • Be sure to make note of interesting findings.

Seems like job_entrepreneur carries that largest coef.

name = bank_md1.columns.drop('y') coef = LogReg1.coef_[0] pd.DataFrame([name,coef],index = ['Name','Coef']).transpose()

Use the Model to predict on x_test and evaluate the model using metric(s) of Choice.

# predict with model y_pred = LogReg1.predict(x_test1)

** Accuracy Score**

  • Wow thats a pretty good score wouldn't you say? Almost 90! Remember the distribution of classes though. In our entire dataset there are 3668 "No" and 451 "Yes" and a total of 4119 observations. If we guessed that nobody was going to convert and therefore 'No' every time, we would be correct 89% of the time (according to out data). That being said, this accuracy is barely better than baseline and such an insignificant difference could just be from how our train test split groupped the data.

metrics.accuracy_score(y_test1,y_pred)
0.898876404494382

Confusion Matrix

Looks like we have 880 True Negatives and 99 False Negatives. That being said it looks like all our model is doing is predicting 'no' everytime.

metrics.confusion_matrix(y_test1,y_pred)
array([[880, 0], [ 99, 0]])

** ROC AUC**

The Area Under the ROC Curve is 0.5 which is completely wothless and our model gains no more insight that random guessing. If we go back to the Accuracy score, we can now conclude that its minuscule improvement above the baseline is caused by our train test split.

metrics.roc_auc_score(y_test1,y_pred)
0.5

Log Loss

metrics.log_loss(y_test1,y_pred)
3.4926852534179349

Model 2: Using age, job, marital, education, contact and day_of_week to predict If the bought or not.

# md = ModelData. Dummies ignores numeric columns such as age and y bank_md2 = pd.get_dummies(bank_a, drop_first = True) # no hyper parameters for first model LogReg2 = LogisticRegression() # X and y features X2 = bank_md2.drop('y', axis =1) y2 = bank_md2['y'] # using train test split to cross val x_train2, x_test2, y_train2, y_test2 = train_test_split(X2,y2, random_state =42) # fit model LogReg2.fit(x_train2, y_train2)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
y_pred2 = LogReg2.predict(x_test2)
metrics.accuracy_score(y_test2,y_pred2)
0.898876404494382
metrics.confusion_matrix(y_test2,y_pred2)
array([[880, 0], [ 99, 0]])
metrics.roc_auc_score(y_test2,y_pred2)
0.5
metrics.log_loss(y_test2,y_pred2)
3.4926852534179349

None of the metrics really changed. Looks like the features we have arn't very helpful...

Is your model not performing very well?

Lets try one more thing before we revert to grabbing more features. Adjusting the probability threshold.

Use the LogisticRegression.predict_proba() attribute to get the probabilities.

Recall from the lesson the first probability is the for class 0 and the second is for class 1

y_pred_prob = LogReg2.predict_proba(x_test2) y_pred_prob
array([[ 0.81582351, 0.18417649], [ 0.93532482, 0.06467518], [ 0.81732417, 0.18267583], ..., [ 0.82688588, 0.17311412], [ 0.95322693, 0.04677307], [ 0.89817216, 0.10182784]])

Visualize the distribution

y_pred_prob_t = y_pred_prob.transpose() import matplotlib.pyplot as plt %matplotlib inline plt.hist(y_pred_prob_t[0]) plt.show() plt.hist(y_pred_prob_t[1])
Image in a Jupyter notebook
(array([ 193., 187., 165., 173., 152., 66., 22., 8., 8., 5.]), array([ 0.00951752, 0.04323572, 0.07695391, 0.11067211, 0.1443903 , 0.1781085 , 0.21182669, 0.24554489, 0.27926308, 0.31298128, 0.34669947]), <a list of 10 Patch objects>)
Image in a Jupyter notebook

** Calculate a new threshold and use it to convert predicted probabilities to output classes**

Lets try decreaseing the threshold to %20 predicted probability or higher.

y_pred3=[] for prob in y_pred_prob_t[1]: if prob > .20: y_pred3.append(1) else: y_pred3.append(0) print(len(y_pred3)) print(len(y_test2))
979 979
y_pred3.count(1) #Actually made some predictions
55

Evaluate the model metrics now

metrics.accuracy_score(y_test2,y_pred3)
0.86108273748723185
metrics.confusion_matrix(y_test2,y_pred3)
array([[834, 46], [ 90, 9]])
metrics.roc_auc_score(y_test2,y_pred3)
0.51931818181818179
metrics.log_loss(y_test2,y_pred3)
4.7980698377830864

Step 4: Build a model using all of the features.

bank_all = pd.get_dummies(bank, drop_first = True)
# no hyper parameters for first model LogReg3 = LogisticRegression(penalty='l1',C=0.01) # X and y features X3 = bank_all.drop('y', axis =1) y3 = bank_all['y'] # using train test split to cross val x_train3, x_test3, y_train3, y_test3 = train_test_split(X3,y3, random_state =42) # fit model LogReg3.fit(x_train3, y_train3)
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l1', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
y_pred3 = LogReg3.predict(x_test3)
metrics.confusion_matrix(y_test3, y_pred3)
array([[895, 26], [ 68, 41]])
metrics.roc_auc_score(y_test3, y_pred3)
0.67395830220442476

Bonus: Use Regularization to optimize your model.

# X and y features X = bank_all.drop('y', axis =1) y = bank_all['y'] # using train test split to cross val x_train, x_test, y_train, y_test = train_test_split(X,y, random_state =42) cees = [0.01, 0.1, 1.0, 10, 100] for c in cees: logreg = LogisticRegression(penalty='l1', C=c) logreg.fit(x_train,y_train) y_pred = logreg.predict(x_test) roc = metrics.roc_auc_score(y_test, y_pred) print(roc," : ", c)
0.673958302204 : 0.01 0.681503949636 : 0.1 0.693636753031 : 1.0 0.694179641196 : 10 0.694179641196 : 100
cees = [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 ,1.8, 1.9] for c in cees: logreg = LogisticRegression(penalty='l1', C=c) logreg.fit(x_train,y_train) y_pred = logreg.predict(x_test) roc = metrics.roc_auc_score(y_test, y_pred) print(roc," : ", c)
0.701182400462 : 1.1 0.701182400462 : 1.2 0.701182400462 : 1.3 0.696595244499 : 1.4 0.697681020829 : 1.5 0.697138132664 : 1.6 0.697138132664 : 1.7 0.697681020829 : 1.8 0.697681020829 : 1.9