Path: blob/master/Machine Learning Supervised Methods/Logistic_Regression on bank phone calls.ipynb
3074 views
Implementation
This notebook explains how using logistic regression to make prediction for outcomes
Data we are using here is from UCI ML repository. It is related to marjetting campains based on phone calls of a banking institution
Goal here to predict whether the client will subscribe to a term deposit or not
About Data
Input variables
1.age (numeric)
2.job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)
3.marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”)
4.education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)
5.default: has credit in default? (categorical: “no”, “yes”, “unknown”)
6.housing: has housing loan? (categorical: “no”, “yes”, “unknown”)
7.loan: has personal loan? (categorical: “no”, “yes”, “unknown”)
8.contact: contact communication type (categorical: “cellular”, “telephone”)
9.month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10.day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
11.duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model
12.campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13.pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14.previous: number of contacts performed before this campaign and for this client (numeric)
15.poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)
16.emp.var.rate: employment variation rate — (numeric)
17.cons.price.idx: consumer price index — (numeric)
18.cons.conf.idx: consumer confidence index — (numeric)
19.euribor3m: euribor 3 month rate — (numeric)
20.nr.employed: number of employees — (numeric)
Predict variable (desired target):
y — has the client subscribed a term deposit? (binary: “1”, means “Yes”, “0” means “No”)
Importing required modules
Prediction variable is "y" 1- yes, 0- no
Our data contains few varaibles which so many categories. for example education. For better modelling we need to reduce this categories
Understanding Data relationship for feature selection using Data Exploration
Our data outcome is imbalanced as number of no-subscription is high as compared to subscription
Insights from above:
Average age of coustmers who bought term deposit is higher than that of the coustmers who didn't
Campaigns(Calls) are lower for coutomers who bought term deposit
Insights:
The frequency of purchase of the deposit depends a great deal on the job title.Thus,the job can be a good predictor of the outcome varaible
Insights:
The marital status does not seem a strong predictor for outcome varaible
Insights
Education seems a good predictor of the outcome variable
Day of week is not a good predictor
Poutcome seems to be a good predictor of the outcome variable
Dimensionality reduction
technique can be defined as, "It is a way of converting the higher dimensions dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques are widely used in machine learning for obtaining a better fit predictive model while solving the classification and regression problems.
Multicollinearity occurs when features (input variables) are highly correlated with one or more of the other features in the dataset. It affects the performance of regression and classification models. PCA (Principal Component Analysis) takes advantage of multicollinearity and combines the highly correlated variables into a set of uncorrelated variable.
PCA PCA is a linear dimensionality reduction technique (algorithm) that transforms a set of correlated variables (p) into a smaller k (k<p) number of uncorrelated variables called principal components while retaining as much of the variation in the original dataset as possible. - If the variables are not measured on a similar scale, we need to do feature scaling before applying PCA for our data. This is because PCA directions are highly sensitive to the scale of the data. - The most important part in PCA is selecting the best number of components for the given dataset