Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Supervised Methods/Logistic_Regression on bank phone calls.ipynb
3074 views
Kernel: Python 3

Implementation

  • This notebook explains how using logistic regression to make prediction for outcomes

  • Data we are using here is from UCI ML repository. It is related to marjetting campains based on phone calls of a banking institution

  • Goal here to predict whether the client will subscribe to a term deposit or not

About Data

Input variables

  • 1.age (numeric)

  • 2.job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)

  • 3.marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”)

  • 4.education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)

  • 5.default: has credit in default? (categorical: “no”, “yes”, “unknown”)

  • 6.housing: has housing loan? (categorical: “no”, “yes”, “unknown”)

  • 7.loan: has personal loan? (categorical: “no”, “yes”, “unknown”)

  • 8.contact: contact communication type (categorical: “cellular”, “telephone”)

  • 9.month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

  • 10.day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)

  • 11.duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model

  • 12.campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

  • 13.pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

  • 14.previous: number of contacts performed before this campaign and for this client (numeric)

  • 15.poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)

  • 16.emp.var.rate: employment variation rate — (numeric)

  • 17.cons.price.idx: consumer price index — (numeric)

  • 18.cons.conf.idx: consumer confidence index — (numeric)

  • 19.euribor3m: euribor 3 month rate — (numeric)

  • 20.nr.employed: number of employees — (numeric)

Predict variable (desired target):

  • y — has the client subscribed a term deposit? (binary: “1”, means “Yes”, “0” means “No”)

Importing required modules

import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline from sklearn import preprocessing plt.rc("font",size=16) from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import seaborn as sns sns.set(style="white") sns.set(style="whitegrid",color_codes=True)
bank_calls=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\BankCalls.xlsx").parse("Calls")
bank_calls.shape
(41188, 21)
bank_calls.dropna()

Prediction variable is "y" 1- yes, 0- no

  • Our data contains few varaibles which so many categories. for example education. For better modelling we need to reduce this categories

bank_calls['education'].unique()
array(['basic.4y', 'unknown', 'university.degree', 'high.school', 'basic.9y', 'professional.course', 'basic.6y', 'illiterate'], dtype=object)
# grouping "basic.4y,basic.9y and basic.9y" together as "Basic" bank_calls['education']=np.where(bank_calls['education']=='basic.9y',"Basic",bank_calls['education']) bank_calls['education']=np.where(bank_calls['education']=='basic.4y',"Basic",bank_calls['education']) bank_calls['education']=np.where(bank_calls['education']=='basic.6y',"Basic",bank_calls['education'])

Understanding Data relationship for feature selection using Data Exploration

bank_calls["y"].value_counts()
0 36548 1 4640 Name: y, dtype: int64
sns.countplot(x="y",data=bank_calls) plt.show()
Image in a Jupyter notebook

Our data outcome is imbalanced as number of no-subscription is high as compared to subscription

print("per_nosubscription=",36548/41188*100)
per_nosubscription= 88.73458288821988
print("per_subscription=",4640/41188*100)
per_subscription= 11.265417111780131
## We have to balance our data but before that lets do more exploration bank_calls.groupby("y").mean()

Insights from above:

  • Average age of coustmers who bought term deposit is higher than that of the coustmers who didn't

  • Campaigns(Calls) are lower for coutomers who bought term deposit

bank_calls.groupby("job").mean()
bank_calls.groupby("marital").mean()
bank_calls.groupby("education").mean()
%matplotlib inline pd.crosstab(bank_calls.job,bank_calls.y).plot(kind="bar") plt.title("Purchase frequency for Job Title") plt.xlabel("job") plt.ylabel("Frequency of purchase")
Text(0, 0.5, 'Frequency of purchase')
Image in a Jupyter notebook

Insights:

  • The frequency of purchase of the deposit depends a great deal on the job title.Thus,the job can be a good predictor of the outcome varaible

t=pd.crosstab(bank_calls.marital,bank_calls.y) t.div(t.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True) plt.title("Marital VS Purchase") plt.xlabel("Marital status") plt.ylabel("Proportion of Coustomers")
Text(0, 0.5, 'Proportion of Coustomers')
Image in a Jupyter notebook

Insights:

The marital status does not seem a strong predictor for outcome varaible

t=pd.crosstab(bank_calls.education,bank_calls.y) t.div(t.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True) plt.title("Education VS Purchase") plt.xlabel("Education") plt.ylabel("Proportion of Coustomers") plt.savefig("education_purchase")
Image in a Jupyter notebook

Insights

  • Education seems a good predictor of the outcome variable

pd.crosstab(bank_calls.day_of_week,bank_calls.y).plot(kind="bar") plt.title("Purchase of frequency for the Day of the week") plt.xlabel("Day of week") plt.ylabel("Purchase frequency")
Text(0, 0.5, 'Purchase frequency')
Image in a Jupyter notebook
  • Day of week is not a good predictor

pd.crosstab(bank_calls.month,bank_calls.y).plot(kind="bar") plt.title("Purchase of frequency for Month") plt.xlabel("Month") plt.ylabel("Purchase frequency")
Text(0, 0.5, 'Purchase frequency')
Image in a Jupyter notebook
bank_calls.age.hist(stacked=True) plt.title("Age") plt.xlabel("age") plt.ylabel("frequency")
Text(0, 0.5, 'frequency')
Image in a Jupyter notebook
bank_calls["age"].plot.hist(color='r', alpha=0.7, bins=50)
<AxesSubplot:ylabel='Frequency'>
Image in a Jupyter notebook
sns.barplot("poutcome","y", data=bank_calls,ci=False)
C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='poutcome', ylabel='y'>
Image in a Jupyter notebook
pd.crosstab(bank_calls.poutcome,bank_calls.y).plot(kind="bar") plt.title("Purchase of frequency for Poutcome") plt.xlabel("Poutcome") plt.ylabel("Purchase frequency")
Text(0, 0.5, 'Purchase frequency')
Image in a Jupyter notebook
  • Poutcome seems to be a good predictor of the outcome variable

new_d=pd.get_dummies(bank_calls) F1 F1.A F1.B A 1 0 B 0 1 F1- A -0 B-1 C-2
# Define Feature matrix ad Target vector X = new_d.drop('y', axis=1) y = new_d.y # Make train and test sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True, random_state=2) # Initialize the logistic regression model from sklearn.linear_model import LogisticRegression clf = LogisticRegression(max_iter=2500) # Train the model clf.fit(X_train, y_train) # Make predictions y_pred = clf.predict(X_test) # Predictions y_true = y_test # True values # Measure accuracy from sklearn.metrics import accuracy_score import numpy as np print("Train accuracy:", np.round(accuracy_score(y_train, clf.predict(X_train)), 2)) print("Test accuracy:", np.round(accuracy_score(y_true, y_pred), 2)) # Make the confusion matrix from sklearn.metrics import confusion_matrix cf_matrix = confusion_matrix(y_true, y_pred) print("\nTest confusion_matrix") sns.heatmap(cf_matrix, annot=True, cmap='Blues') plt.xlabel('Predicted', fontsize=12) plt.ylabel('True', fontsize=12)
Train accuracy: 0.91 Test accuracy: 0.9 Test confusion_matrix
Text(30.5, 0.5, 'True')
Image in a Jupyter notebook

Dimensionality reduction

technique can be defined as, "It is a way of converting the higher dimensions dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques are widely used in machine learning for obtaining a better fit predictive model while solving the classification and regression problems.

  • Multicollinearity occurs when features (input variables) are highly correlated with one or more of the other features in the dataset. It affects the performance of regression and classification models. PCA (Principal Component Analysis) takes advantage of multicollinearity and combines the highly correlated variables into a set of uncorrelated variable.

  • PCA PCA is a linear dimensionality reduction technique (algorithm) that transforms a set of correlated variables (p) into a smaller k (k<p) number of uncorrelated variables called principal components while retaining as much of the variation in the original dataset as possible. - If the variables are not measured on a similar scale, we need to do feature scaling before applying PCA for our data. This is because PCA directions are highly sensitive to the scale of the data. - The most important part in PCA is selecting the best number of components for the given dataset