Path: blob/master/14_imbalanced/handling_imbalanced_data.ipynb
1141 views
Handling imbalanced data in customer churn prediction
Customer churn prediction is to measure why customers are leaving a business. In this tutorial we will be looking at customer churn in telecom business. We will build a deep learning model to predict the churn and use precision,recall, f1-score to measure performance of our model. We will then handle imbalance in data using various techniques and improve f1-score
Load the data
First of all, drop customerID column as it is of no use
Quick glance at above makes me realize that TotalCharges should be float but it is an object. Let's check what's going on with this column
Ahh... it is string. Lets convert it to numbers
Remove rows with space in TotalCharges
Data Visualization
Many of the columns are yes, no etc. Let's print unique values in object columns to see data values
Some of the columns have no internet service or no phone service, that can be replaced with a simple No
Convert Yes and No to 1 or 0
One hot encoding for categorical columns
Train test split
Build a model (ANN) in tensorflow/keras
Mitigating Skewdness of Data
Method 1: Undersampling
Printing Classification in the last, Scroll down till the last epoch to watch the classification report
Check classification report above. f1-score for minority class 1 improved from 0.57 to 0.76. Score for class 0 reduced to 0.75 from 0.85 but that's ok. We have more generalized classifier which classifies both classes with similar prediction score
Method2: Oversampling
Check classification report above. f1-score for minority class 1 improved from 0.57 to 0.76. Score for class 0 reduced to 0.75 from 0.85 but that's ok. We have more generalized classifier which classifies both classes with similar prediction score
Method3: SMOTE
To install imbalanced-learn library use pip install imbalanced-learn command
SMOT Oversampling increases f1 score of minority class 1 from 0.57 to 0.81 (huge improvement) Also over all accuracy improves from 0.78 to 0.80
Method4: Use of Ensemble with undersampling
model1 --> class1(1495) + class0(0, 1495)
model2 --> class1(1495) + class0(1496, 2990)
model3 --> class1(1495) + class0(2990, 4130)
f1-score for minority class 1 improved to 0.62 from 0.57. The score for majority class 0 is suffering and reduced to 0.80 from 0.85 but at least there is some balance in terms of prediction accuracy across two classes