GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_07/code/knn_classification_imputation-lab.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

KNN Classification and Imputation: Cell Phone Churn Data

Authors: Kiefer Katovich (SF)

In this lab you will practice using KNN for classification (and a little bit for regression as well).

The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.

Our goal is to predict whether a user will churn or not based on the other features.

We will also be using the KNN model to impute missing data. There are a couple of columns in the dataset with missing values, and we can build KNN models to predict what those missing values will most likely be. This is a more advanced imputation method than just filling in the mean or median.

In [1]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.neighbors import KNeighborsClassifier

1. Load the cell phone "churn" data containing some missing values.

In [2]:

churn = pd.read_csv('../assets/data/churn_missing.csv')

2. Examine the data. What columns have missing values?

Remember to use our standard 4 or 5 commands (head, describe, info, isnull, dtypes....)

In [3]:

# A:
churn.head()

Out[3]:

In [4]:

churn.describe()

Out[4]:

In [17]:

churn.isnull().sum()

Out[17]:

state               0
account_length      0
area_code           0
intl_plan           0
vmail_plan        400
vmail_message     400
day_mins            0
day_calls           0
day_charge          0
eve_mins            0
eve_calls           0
eve_charge          0
night_mins          0
night_calls         0
night_charge        0
intl_mins           0
intl_calls          0
intl_charge         0
custserv_calls      0
churn               0
dtype: int64

In [19]:

churn.dtypes

Out[19]:

state              object
account_length      int64
area_code           int64
intl_plan          object
vmail_plan         object
vmail_message     float64
day_mins          float64
day_calls           int64
day_charge        float64
eve_mins          float64
eve_calls           int64
eve_charge        float64
night_mins        float64
night_calls         int64
night_charge      float64
intl_mins         float64
intl_calls          int64
intl_charge       float64
custserv_calls      int64
churn                bool
dtype: object

3. Convert the `vmail_plan` and `intl_plan` colums to binary integer columns.

Make sure that if a value is missing that you don't fill it in with a new value! Preserve the missing values.

In [20]:

# A Some code to help you - turns vmail_plan into 0 or 1 column value
churn.loc[:,'vmail_plan'] = churn.vmail_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x)
churn.loc[:,'intl_plan'] = churn.intl_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x)

In [22]:

churn['intl_plan'].dtypes

Out[22]:

dtype('int64')

In [54]:

churn.state.value_counts()

Out[54]:

WV    106
MN     84
NY     83
AL     80
OR     78
WI     78
OH     78
WY     77
VA     77
CT     74
ID     73
MI     73
VT     73
TX     72
UT     72
IN     71
KS     70
MD     70
NJ     68
MT     68
NC     68
NV     66
WA     66
CO     66
RI     65
MS     65
MA     65
AZ     64
MO     63
FL     63
ND     62
NM     62
ME     62
NE     61
OK     61
DE     61
SC     60
SD     60
KY     59
IL     58
NH     56
AR     55
GA     54
DC     54
TN     53
HI     53
AK     52
LA     51
PA     45
IA     44
CA     34
Name: state, dtype: int64

4. Create dummy coded columns for state and concatenate it to the churn dataset.

Remember: You will need to leave out one of the state dummy coded columns to serve as the "reference" column since we will be using these for modeling.

use pd.get_dummies(..., drop_first = True)

In [24]:

# A:
states = pd.get_dummies(churn.state, drop_first=True)
states.head(3)

#dummies always drops the first category, in this case AK

Out[24]:

In [26]:

states.shape #this dataset has 51 states, 50 states plus DC = 51

Out[26]:

(3333, 50)

In [29]:

## Concatenate back to a single dataset
churn = pd.concat([churn, states], axis=1)

5. Create a version of the churn data that has no missing values.

Use dropna(). Calculate the shape

In [33]:

# A:

churn_nona = churn.dropna()
churn_nona.info()
churn_nona.shape

Out[33]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2933 entries, 0 to 3332
Columns: 120 entries, state to WY
dtypes: bool(1), float64(10), int64(8), object(1), uint8(100)
memory usage: 747.6+ KB

(2933, 120)

6. Create a target vector and predictor matrix.

Target should be the churn column.
Predictor matrix should be all columns except area_code, state, and churn.

In [34]:

# A:
X = churn_nona.drop(['area_code','state','churn'], axis =1)
y = churn_nona.churn.values

7. Calculate the baseline accuracy for `churn`.

What percent of the churn target values (y) == 1? (this is just the average value of the column. Why is that?)

In [56]:

# A:
y.mean()
#y.sum()
#y.sum()/len(y)

churn_nona.churn.mean()
#less than 0.5

Out[56]:

0.14353903852710534

8. Cross-validate a KNN model predicting `churn`.

Number of neighbors should be 5.
Make sure to standardize the predictor matrix.
Set cross-validation folds to 10.

Report the mean cross-validated accuracy.

In [9]:

# A:

9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

Plot the cross-validated mean accuracy for each score. What is the best accuracy?

In [10]:

# A:k_values = list(range(1,50,2))
accs = []
for k in k_values:
    knn = #fill in here
    scores = cross_val_score(#fill in here)
    accs.append(np.mean(#fill in here))

In [ ]:

fig, ax = plt.subplots(figsize=(8,5))
ax.plot(k_values, accs, lw=3)
plt.show()

print(np.max(accs))

KNN Classification and Imputation: Cell Phone Churn Data

1. Load the cell phone "churn" data containing some missing values.

2. Examine the data. What columns have missing values?

3. Convert the `vmail_plan` and `intl_plan` colums to binary integer columns.

4. Create dummy coded columns for state and concatenate it to the churn dataset.

5. Create a version of the churn data that has no missing values.

6. Create a target vector and predictor matrix.

7. Calculate the baseline accuracy for `churn`.

8. Cross-validate a KNN model predicting `churn`.

9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

Product

Resources

Company

KNN Classification and Imputation: Cell Phone Churn Data

1. Load the cell phone "churn" data containing some missing values.

2. Examine the data. What columns have missing values?

3. Convert the vmail_plan and intl_plan colums to binary integer columns.

4. Create dummy coded columns for state and concatenate it to the churn dataset.

5. Create a version of the churn data that has no missing values.

6. Create a target vector and predictor matrix.

7. Calculate the baseline accuracy for churn.

8. Cross-validate a KNN model predicting churn.

9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

3. Convert the `vmail_plan` and `intl_plan` colums to binary integer columns.

7. Calculate the baseline accuracy for `churn`.

8. Cross-validate a KNN model predicting `churn`.