CoCalc -- Logistic_Regression on bank phone calls.ipynb

GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Supervised Methods/Logistic_Regression on bank phone calls.ipynb
³⁰⁷⁴ views

Kernel: Python 3

Implementation

This notebook explains how using logistic regression to make prediction for outcomes
Data we are using here is from UCI ML repository. It is related to marjetting campains based on phone calls of a banking institution
Goal here to predict whether the client will subscribe to a term deposit or not

About Data

Input variables

1.age (numeric)
2.job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)
3.marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”)
4.education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)
5.default: has credit in default? (categorical: “no”, “yes”, “unknown”)
6.housing: has housing loan? (categorical: “no”, “yes”, “unknown”)
7.loan: has personal loan? (categorical: “no”, “yes”, “unknown”)
8.contact: contact communication type (categorical: “cellular”, “telephone”)
9.month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10.day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
11.duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model
12.campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13.pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14.previous: number of contacts performed before this campaign and for this client (numeric)
15.poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)
16.emp.var.rate: employment variation rate — (numeric)
17.cons.price.idx: consumer price index — (numeric)
18.cons.conf.idx: consumer confidence index — (numeric)
19.euribor3m: euribor 3 month rate — (numeric)
20.nr.employed: number of employees — (numeric)

Predict variable (desired target):

y — has the client subscribed a term deposit? (binary: “1”, means “Yes”, “0” means “No”)

Importing required modules

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
plt.rc("font",size=16)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid",color_codes=True)

In [3]:

bank_calls=pd.ExcelFile(r"C:\Users\suyashi144893\Documents\data Sets\BankCalls.xlsx").parse("Calls")

In [4]:

bank_calls.shape

Out[4]:

(41188, 21)

In [5]:

bank_calls.dropna()

Out[5]:

Prediction variable is "y" 1- yes, 0- no

Our data contains few varaibles which so many categories. for example education. For better modelling we need to reduce this categories

In [5]:

bank_calls['education'].unique()

Out[5]:

array(['basic.4y', 'unknown', 'university.degree', 'high.school',
       'basic.9y', 'professional.course', 'basic.6y', 'illiterate'],
      dtype=object)

In [6]:

# grouping "basic.4y,basic.9y and basic.9y" together as "Basic"
bank_calls['education']=np.where(bank_calls['education']=='basic.9y',"Basic",bank_calls['education'])
bank_calls['education']=np.where(bank_calls['education']=='basic.4y',"Basic",bank_calls['education'])
bank_calls['education']=np.where(bank_calls['education']=='basic.6y',"Basic",bank_calls['education'])

Understanding Data relationship for feature selection using Data Exploration

In [7]:

bank_calls["y"].value_counts()

Out[7]:

0    36548
1     4640
Name: y, dtype: int64

In [8]:

sns.countplot(x="y",data=bank_calls)
plt.show()

Out[8]:

Our data outcome is imbalanced as number of no-subscription is high as compared to subscription

In [9]:

print("per_nosubscription=",36548/41188*100)

Out[9]:

per_nosubscription= 88.73458288821988

In [10]:

print("per_subscription=",4640/41188*100)

Out[10]:

per_subscription= 11.265417111780131

In [11]:

## We have to balance our data but before that lets do more exploration
bank_calls.groupby("y").mean()

Out[11]:

Insights from above:

Average age of coustmers who bought term deposit is higher than that of the coustmers who didn't
Campaigns(Calls) are lower for coutomers who bought term deposit

In [12]:

bank_calls.groupby("job").mean()

Out[12]:

In [13]:

bank_calls.groupby("marital").mean()

Out[13]:

In [14]:

bank_calls.groupby("education").mean()

Out[14]:

In [15]:

%matplotlib inline 
pd.crosstab(bank_calls.job,bank_calls.y).plot(kind="bar")
plt.title("Purchase frequency for Job Title")
plt.xlabel("job")
plt.ylabel("Frequency of purchase")

Out[15]:

Text(0, 0.5, 'Frequency of purchase')

Insights:

The frequency of purchase of the deposit depends a great deal on the job title.Thus,the job can be a good predictor of the outcome varaible

In [16]:

t=pd.crosstab(bank_calls.marital,bank_calls.y)
t.div(t.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True)
plt.title("Marital VS Purchase")
plt.xlabel("Marital status")
plt.ylabel("Proportion of Coustomers")

Out[16]:

Text(0, 0.5, 'Proportion of Coustomers')

Insights:

The marital status does not seem a strong predictor for outcome varaible

In [17]:

t=pd.crosstab(bank_calls.education,bank_calls.y)
t.div(t.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True)
plt.title("Education VS Purchase")
plt.xlabel("Education")
plt.ylabel("Proportion of Coustomers")
plt.savefig("education_purchase")

Out[17]:

Insights

Education seems a good predictor of the outcome variable

In [18]:

pd.crosstab(bank_calls.day_of_week,bank_calls.y).plot(kind="bar")
plt.title("Purchase of frequency for the Day of the week")
plt.xlabel("Day of week")
plt.ylabel("Purchase frequency")

Out[18]:

Text(0, 0.5, 'Purchase frequency')

Day of week is not a good predictor

In [19]:

pd.crosstab(bank_calls.month,bank_calls.y).plot(kind="bar")
plt.title("Purchase of frequency for Month")
plt.xlabel("Month")
plt.ylabel("Purchase frequency")

Out[19]:

Text(0, 0.5, 'Purchase frequency')

In [20]:

bank_calls.age.hist(stacked=True)
plt.title("Age")
plt.xlabel("age")
plt.ylabel("frequency")

Out[20]:

Text(0, 0.5, 'frequency')

In [21]:

bank_calls["age"].plot.hist(color='r', alpha=0.7, bins=50)

Out[21]:

<AxesSubplot:ylabel='Frequency'>

In [22]:

sns.barplot("poutcome","y", data=bank_calls,ci=False)

Out[22]:

C:\Users\suyashi144893\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

<AxesSubplot:xlabel='poutcome', ylabel='y'>

In [23]:

pd.crosstab(bank_calls.poutcome,bank_calls.y).plot(kind="bar")
plt.title("Purchase of frequency for Poutcome")
plt.xlabel("Poutcome")
plt.ylabel("Purchase frequency")

Out[23]:

Text(0, 0.5, 'Purchase frequency')

Poutcome seems to be a good predictor of the outcome variable

new_d=pd.get_dummies(bank_calls)
F1   F1.A  F1.B
A     1     0
B     0     1

F1- A -0
    B-1
    C-2

In [12]:

# Define Feature matrix ad Target vector
X = new_d.drop('y', axis=1)
y = new_d.y

# Make train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, 
                                                    shuffle=True, random_state=2)

# Initialize the logistic regression model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=2500)

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test) # Predictions
y_true = y_test # True values

# Measure accuracy
from sklearn.metrics import accuracy_score
import numpy as np
print("Train accuracy:", np.round(accuracy_score(y_train, 
                                                 clf.predict(X_train)), 2))
print("Test accuracy:", np.round(accuracy_score(y_true, y_pred), 2))

# Make the confusion matrix
from sklearn.metrics import confusion_matrix
cf_matrix = confusion_matrix(y_true, y_pred)
print("\nTest confusion_matrix")
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('True', fontsize=12)

Out[12]:

Train accuracy: 0.91
Test accuracy: 0.9

Test confusion_matrix

Text(30.5, 0.5, 'True')

In [ ]:

Dimensionality reduction

technique can be defined as, "It is a way of converting the higher dimensions dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques are widely used in machine learning for obtaining a better fit predictive model while solving the classification and regression problems.

Multicollinearity occurs when features (input variables) are highly correlated with one or more of the other features in the dataset. It affects the performance of regression and classification models. PCA (Principal Component Analysis) takes advantage of multicollinearity and combines the highly correlated variables into a set of uncorrelated variable.
PCA PCA is a linear dimensionality reduction technique (algorithm) that transforms a set of correlated variables (p) into a smaller k (k<p) number of uncorrelated variables called principal components while retaining as much of the variation in the original dataset as possible. - If the variables are not measured on a similar scale, we need to do feature scaling before applying PCA for our data. This is because PCA directions are highly sensitive to the scale of the data. - The most important part in PCA is selecting the best number of components for the given dataset

Implementation

About Data

Input variables

Predict variable (desired target):

Importing required modules

Prediction variable is "y" 1- yes, 0- no

Understanding Data relationship for feature selection using Data Exploration

Our data outcome is imbalanced as number of no-subscription is high as compared to subscription

Insights from above:

Insights:

Insights:

Insights

Dimensionality reduction

Product

Resources

Company