GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_07/code/solution-code/knn_classification_imputation-lab-solutions - (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

KNN Classification and Imputation: Cell Phone Churn Data

Authors: Kiefer Katovich (SF)

In this lab you will practice using KNN for classification (and a little bit for regression as well).

The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.

Our goal is to predict whether a user will churn or not based on the other features.

We will also be using the KNN model to impute missing data. There are a couple of columns in the dataset with missing values, and we can build KNN models to predict what those missing values will most likely be. This is a more advanced imputation method than just filling in the mean or median.

In [2]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.neighbors import KNeighborsClassifier

1. Load the cell phone "churn" data containing some missing values.

In [3]:

churn = pd.read_csv('../../assets/data/churn_missing.csv')

2. Examine the data. What columns have missing values?

In [4]:

churn.head(3)

Out[4]:

In [5]:

churn.shape

Out[5]:

(3333, 20)

In [6]:

churn.isnull().sum()

Out[6]:

state               0
account_length      0
area_code           0
intl_plan           0
vmail_plan        400
vmail_message     400
day_mins            0
day_calls           0
day_charge          0
eve_mins            0
eve_calls           0
eve_charge          0
night_mins          0
night_calls         0
night_charge        0
intl_mins           0
intl_calls          0
intl_charge         0
custserv_calls      0
churn               0
dtype: int64

In [7]:

# about 12% ofvmail plan & message are null

In [8]:

churn.dtypes

Out[8]:

state              object
account_length      int64
area_code           int64
intl_plan          object
vmail_plan         object
vmail_message     float64
day_mins          float64
day_calls           int64
day_charge        float64
eve_mins          float64
eve_calls           int64
eve_charge        float64
night_mins        float64
night_calls         int64
night_charge      float64
intl_mins         float64
intl_calls          int64
intl_charge       float64
custserv_calls      int64
churn                bool
dtype: object

In [9]:

churn.intl_plan.value_counts(dropna=False)

Out[9]:

no     3010
yes     323
Name: intl_plan, dtype: int64

In [11]:

churn.vmail_plan.value_counts(dropna=False)    #dropna=True will drop na's

Out[11]:

no     2130
yes     803
NaN     400
Name: vmail_plan, dtype: int64

In [12]:

churn['state'].value_counts()

Out[12]:

WV    106
MN     84
NY     83
AL     80
OR     78
WI     78
OH     78
WY     77
VA     77
CT     74
MI     73
ID     73
VT     73
UT     72
TX     72
IN     71
MD     70
KS     70
MT     68
NJ     68
NC     68
WA     66
NV     66
CO     66
RI     65
MS     65
MA     65
AZ     64
FL     63
MO     63
NM     62
ND     62
ME     62
NE     61
OK     61
DE     61
SC     60
SD     60
KY     59
IL     58
NH     56
AR     55
DC     54
GA     54
TN     53
HI     53
AK     52
LA     51
PA     45
IA     44
CA     34
Name: state, dtype: int64

In [13]:

#Note: DC is being counted as a state

3. Convert the `vmail_plan` and `intl_plan` colums to binary integer columns.

Make sure that if a value is missing that you don't fill it in with a new value! Preserve the missing values.

In [14]:

churn.loc[:,'vmail_plan'] = churn.vmail_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x)
churn.loc[:,'intl_plan'] = churn.intl_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x)

4. Create dummy coded columns for state and concatenate it to the churn dataset.

Remember: You will need to leave out one of the state dummy coded columns to serve as the "reference" column since we will be using these for modeling.

In [15]:

states = pd.get_dummies(churn.state, drop_first=True)
states.head(3)
# drop_first removed Alaska(AK) from being converted to a dummy column

Out[15]:

In [20]:

states.shape

Out[20]:

(3333, 50)

In [21]:

print(len(churn.state.unique()))

Out[21]:

51

In [16]:

churn = pd.concat([churn, states], axis=1) #smooshed side by side, otherwise 0 would be top and bottom

5. Create a version of the churn data that has no missing values.

Calculate the shape

In [17]:

churn_nona = churn.dropna()
churn_nona.shape

Out[17]:

(2933, 70)

6. Create a target vector and predictor matrix.

Target should be the churn column.
Predictor matrix should be all columns except area_code, state, and churn.

In [19]:

X = churn_nona.drop(['area_code','state','churn'], axis =1)
y = churn_nona.churn.values

7. Calculate the baseline accuracy for `churn`.

In [20]:

churn_nona.churn.mean()
# Less than 0.5

Out[20]:

0.14353903852710534

In [22]:

baseline = 1. - churn_nona.churn.mean() #if you entered zero as your prediction, you would hit 85% accuracy
print(baseline)

Out[22]:

0.8564609614728946

8. Cross-validate a KNN model predicting `churn`.

Number of neighbors should be 5.
Make sure to standardize the predictor matrix.
Set cross-validation folds to 10.

Report the mean cross-validated accuracy.

In [23]:

knn = KNeighborsClassifier(n_neighbors=5)

In [24]:

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

In [28]:

ss = StandardScaler() #this like a z score, this is a class
Xs = ss.fit_transform(X) #fit_transform is a function within the stabdardscaler class, is an operation that fits the scaler, and transforms the results on the x values

In [27]:

scores = cross_val_score(knn, Xs, y, cv=10) #knn=model, Xs=x values transformed, y=actual y values, cv=how many folds you want
print(scores)
print(np.mean(scores))

Out[27]:

[0.84745763 0.86054422 0.86006826 0.85665529 0.87030717 0.85665529
 0.85665529 0.85324232 0.85665529 0.85665529]
0.8574896042757937

note that the knn model produced a result that was not necessarily better than the baseline

9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

Plot the cross-validated mean accuracy for each score. What is the best accuracy?

In [30]:

k_values = list(range(1,50,2)) #adds 2 every time, so at 1 it will equal 3, and at 3 it wil equal 5 etc.
accs = []
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k) #this initiates the knn model
    scores = cross_val_score(knn, Xs, y, cv=10) #this runs the knn model 10 times
    accs.append(np.mean(scores)) #this gives the mea of the 10 models we just ran

In [33]:

fig, ax = plt.subplots(figsize=(8,5))
ax.plot(k_values, accs, lw=3) #lw is line width, can play with # to increase or decrease
plt.show()

print(np.max(accs))

Out[33]:

0.8581698921253003

In [34]:

# it looks like there is a peak at about k=7 and then levels out after k=17, optimal k value = 7

10. Imputing with KNN

K-Nearest Neighbors can be used to impute missing values in datasets. What we will do is estimate the most likely value for the missing data based on a KNN model.

We have two columns with missing data:

vmail_plan
vmail_message

10.A: Create two subsets of the churn dataset: one without missing values for vmail_plan and vmail_message, and one with the missing values.

In [35]:

from sklearn.neighbors import KNeighborsRegressor
missing_cols = ['vmail_plan','vmail_message']

In [36]:

impute_missing = churn.loc[churn.vmail_plan.isnull(), :] #this gives all the rows where vmail_plain is null and all the columns
impute_valid = churn.loc[~churn.vmail_plan.isnull(), :] #this is the inverse

First we will impute values for vmail_plan. This is a categorical column and so we will impute using classification (predicting whether the plan is yes or no, 1 vs. 0).

10.B: Create a target that is vmail_plan and predictor matrix that is all columns except state, area_code, churn, vmail_plan, and vmail_message.

Create a target (prediction or y value) and predictor matrix (x values).

Note: We don't include the churn variable in the model to impute. Why? We are imputing these missing values so that we can use the rows to predict churn with more data afterwards. If we imputed with churn as a predictor then we would be cheating.

this is aka data leakage, bc if the model knows what churn is you will be performing very well, simply bc the model already knows what to predict - collinearity

In [40]:

impute_cols = [c for c in impute_valid.columns if not c in ['state','area_code','churn']+missing_cols]
#columns to keep, for column in .... if not column in ... ... ... + ... ..., go to front and keep column
y = impute_valid.vmail_plan.values
X = impute_valid[impute_cols]

10.C: Standardize the predictor matrix.

In [42]:

ss = StandardScaler()
Xs = ss.fit_transform(X)

In [43]:

X.columns

Out[43]:

Index(['account_length', 'intl_plan', 'day_mins', 'day_calls', 'day_charge',
       'eve_mins', 'eve_calls', 'eve_charge', 'night_mins', 'night_calls',
       'night_charge', 'intl_mins', 'intl_calls', 'intl_charge',
       'custserv_calls', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL',
       'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV',
       'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA',
       'VT', 'WA', 'WI', 'WV', 'WY'],
      dtype='object')

10.D: Find the best K for predicting vmail_plan.

You may want to write a function for this. What is the accuracy for predicting vmail_plan at the best K? What is the baseline accuracy for vmail_plan?

In [48]:

def find_best_k_cls(X, y, k_min=1, k_max=51, step=2, cv=5):
    k_range = list(range(k_min, k_max+1, step))
    accs = []
    for k in k_range:
        knn = KNeighborsClassifier(n_neighbors=k)
        scores = cross_val_score(knn, X, y, cv=cv)
        accs.append(np.mean(scores))
    print(np.max(accs), np.argmax(k_range)) #prints the max and argmax (look into what argmax is)
    return np.argmax(k_range)

In [49]:

find_best_k_cls(Xs, y)

Out[49]:

0.7262193952009117 25

25

In [50]:

impute_valid.vmail_plan.mean() #if you predicted all ones for vmail_plan, you would be right 27% of the time
# less than 0.5

Out[50]:

0.27378111148994205

In [53]:

vmail_plan_baseline = 1. - impute_valid.vmail_plan.mean() #if you predicted all zeros for vmail_plan, you would be right 73% of the time
print(vmail_plan_baseline)

Out[53]:

0.726218888510058

In [54]:

# we can already see that our calculated baseline is practically
# identical to the best CV score which leads us down the path
# to believe that there would be very little knowledge gain from
# said model

#AKA shitty model

10.E: Fit a KNeighborsClassifier with the best number of neighbors.

In [56]:

knn = KNeighborsClassifier(n_neighbors=25)
knn.fit(Xs, y)

Out[56]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=25, p=2,
           weights='uniform')

10.F: Predict the missing vmail_plan values using the subset of the data where it is misssing.

You will need to:

Create a new predictor matrix using the same predictors but from the missing subset of data.

Standardize this predictor matrix using the StandardScaler object fit on the non-missing data. This means you will just use the .transform() function. It is important to standardize the new predictors the same way we standardized the original predictors if we want the predictions to make sense. Calling .fit_transform() will reset the standardized scale.
Predict what the missing vmail plan values should be.
Replace the missing values in the original with the predicted values.

Note: It may predict all 0's. This is OK. If you want to see the predicted probabilities of vmail_plan for each row you can use the .predict_proba() function instead of .predict(). You can use these probabilities to manually set the criteria threshold.

In [59]:

X_miss = impute_missing[impute_cols]
X_miss_s = ss.transform(X_miss)
# you don't fit again, bc you want to keep the mean the same as when you ran your first model, 
# and that is true for train test split too, so you just apply the same transformations
# if you don't do this you will get an inaccurate fit as the model criteria have shifted

In [61]:

vmail_plan_impute = knn.predict(X_miss_s) #this is when you output your y hat values

In [62]:

vmail_plan_impute

Out[62]:

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [63]:

# creating a DF copy to use to imputed missing values
churn_imputed = churn.copy()
# filling missing vmail_plan values with those predicted by KNN model
churn_imputed.loc[churn.vmail_plan.isnull(), 'vmail_plan'] = vmail_plan_impute

11. Impute the missing values for `vmail_message` using the same process.

Since vmail_message is essentially a continuous measure, you need to use KNeighborsRegressor instead of the KNeighborsClassifier.

KNN can do both regression and classification! Instead of "voting" on the class like in classification, the neighbors will average their value for the target in regression.

In [64]:

def find_best_k_reg(X, y, k_min=1, k_max=51, step=2, cv=10):
    k_range = list(range(k_min, k_max+1, step))
    r2s = []
    for k in k_range:
        knn = KNeighborsRegressor(n_neighbors=k)
        scores = cross_val_score(knn, X, y, cv=cv)
        r2s.append(np.mean(scores))
    print(np.max(r2s), np.argmax(k_range))
    return np.argmax(k_range)

In [71]:

# x-true & y-true
y = impute_valid.vmail_message.values
X = impute_valid[impute_cols]

# set and fit the scaler
ss = StandardScaler()
Xs = ss.fit_transform(X)

# call/use find k-best function on known data
best_k = find_best_k_reg(Xs, y)

# apply k-best to fit model
knn = KNeighborsRegressor(n_neighbors=best_k)
knn.fit(Xs, y)

# prepair rows with missing target values
X_miss = impute_missing[impute_cols]
X_miss_s = ss.transform(X_miss)

# use model to predict unknown values
vmail_message_impute = knn.predict(X_miss_s)
vmail_message_impute

# this r2 is not very good

#do you mean the fit of the model? why is r2 brought up here? If it is accuracy, how can you have negative accuracy?

Out[71]:

-0.034936418201589305 25

array([ 7.68,  8.28,  5.36,  8.48, 10.68,  6.  ,  8.68,  8.08,  9.32,
        7.8 ,  9.72,  3.8 ,  8.88,  4.2 ,  2.44, 11.72,  8.04,  6.88,
        8.08,  7.84, 11.76,  7.56,  8.12, 11.84,  5.16,  9.88, 15.48,
       10.04,  6.76, 11.72,  7.96,  7.68,  8.92,  8.88,  6.4 ,  7.36,
        9.72, 10.04, 11.  ,  5.16, 10.96,  7.4 ,  4.92, 11.4 , 10.36,
       10.56,  7.08,  8.72,  9.28,  6.24,  8.92,  7.84, 12.24,  6.84,
        4.24, 10.8 ,  5.72,  5.92,  4.76,  8.72,  6.88,  8.32,  6.92,
        9.04,  7.36,  9.2 ,  5.48,  2.92,  4.  , 11.2 ,  9.36,  7.84,
        7.76,  3.6 ,  8.2 ,  9.32, 10.2 ,  8.08,  7.8 , 10.12,  7.64,
       12.08,  8.52,  3.28,  8.72,  7.  ,  2.8 ,  9.12,  6.36,  9.2 ,
       11.88,  9.  ,  7.84, 10.8 ,  9.44,  5.72, 10.36, 10.52,  3.72,
        8.8 ,  4.72,  6.32,  8.36,  5.44,  6.8 ,  4.04,  3.88,  3.4 ,
        4.88,  5.28,  1.68,  9.28,  9.24,  9.28,  6.28,  9.8 ,  5.68,
        4.32,  3.8 ,  8.4 ,  9.12,  9.04,  4.2 ,  9.88, 11.32,  4.84,
        8.64, 11.  , 11.2 ,  9.6 ,  6.12,  8.6 ,  8.56,  4.24,  9.12,
        7.24,  6.92,  7.44,  4.4 ,  9.44,  5.8 ,  7.52,  8.2 ,  6.16,
        8.44, 11.44,  7.52,  8.92, 10.08,  6.24,  9.76,  9.12,  7.56,
       10.24,  8.04,  7.4 ,  7.28,  4.12, 10.88,  4.32,  9.32,  4.64,
        6.48,  4.48, 10.8 ,  6.52, 10.44, 11.2 ,  9.4 ,  5.92,  8.36,
        9.04, 10.68, 11.44, 10.88,  5.6 ,  5.64,  7.12,  8.36,  7.64,
        7.24,  7.24,  8.  ,  7.84,  9.12, 12.48,  3.52,  8.96,  6.88,
        8.6 ,  7.12,  8.04,  9.24,  7.48,  5.12,  5.08,  6.84,  6.36,
        6.72,  9.24, 12.24,  7.08,  9.76, 11.08,  1.04,  7.48,  6.4 ,
        4.04, 10.08, 10.6 ,  8.8 ,  7.88,  7.4 ,  6.  ,  5.52,  8.72,
        7.68,  2.72,  8.12,  9.28,  8.8 ,  4.4 ,  7.64,  9.12,  8.2 ,
        6.28,  5.76,  8.2 ,  9.68,  8.24,  8.28,  0.64,  7.56,  8.08,
        4.8 , 11.36,  9.72,  7.28,  7.08,  8.52,  8.72,  7.92,  6.08,
        9.2 ,  9.36,  9.72,  7.2 ,  9.56,  8.6 , 10.56,  6.88,  6.84,
        8.6 ,  5.08,  8.96, 10.8 ,  6.08, 11.32,  8.28,  6.72,  5.16,
       11.8 ,  6.6 , 10.  ,  9.04,  9.28,  8.6 ,  4.36,  4.96,  7.4 ,
        5.12, 12.72,  9.56,  8.96,  9.08,  6.56,  7.12,  6.8 ,  7.96,
        7.12, 11.48,  6.56,  8.4 ,  9.64,  5.08,  3.48,  9.  ,  7.28,
        7.36,  8.56,  8.56,  9.56,  7.88,  6.72,  9.64,  5.68,  8.76,
        9.24, 12.28,  7.6 , 11.12, 10.56,  8.32,  9.24,  6.32,  7.68,
        8.84,  1.64,  9.68,  7.96, 13.8 ,  4.56,  8.04,  4.32, 13.04,
        7.92,  4.36,  8.8 ,  7.4 ,  8.52, 13.72,  8.92,  8.16,  8.16,
        6.08,  7.28,  7.16,  3.12,  7.08,  8.08, 10.04,  4.88,  6.2 ,
       10.24,  7.76,  5.96,  8.64,  7.52,  7.08,  8.76,  7.24,  4.92,
        7.96,  7.36,  9.44,  5.32,  6.52,  7.48,  7.4 ,  5.4 ,  8.76,
        5.76,  8.  ,  8.08,  8.2 ,  8.76, 10.36, 12.92,  6.92,  5.04,
        5.64, 12.24,  8.16,  8.88,  8.44,  3.56,  8.28,  9.72,  9.96,
        8.36,  9.  ,  9.48,  7.72,  5.32,  5.72,  7.24,  9.8 ,  6.16,
        6.28,  9.96,  7.92,  7.32,  7.72,  7.92,  8.72,  7.32,  4.96,
       13.04,  2.28,  5.52,  6.56,  6.72,  8.12,  7.2 ,  6.2 , 12.68,
        9.96,  8.76,  7.72, 11.48])

In [70]:

# assign predicted values to missing values in dataframe
churn_imputed.loc[churn.vmail_message.isnull(), 'vmail_message'] = vmail_message_impute

12. Given the accuracy (and $R^2$ ) of your best imputation models when finding the best K neighbors, do you think imputing is a good idea?

In [73]:

# The accuracy and R2 are very bad. Thus our imputed values are most likely wrong with these models.
# This doesn't necessarily mean that imputation is a bad idea, but we may want to consider
# using a different method.

13. With the imputed dataset, cross-validate the accuracy predicting churn. Is it better? Worse? The same?

In [75]:

X = churn_imputed[[c for c in churn_nona.columns if not c in ['area_code','state','churn']]]
y = churn_imputed.churn.values

ss = StandardScaler()
Xs = ss.fit_transform(X)

k_values = list(range(1,50,2))
accs = []
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, Xs, y, cv=10)
    accs.append(np.mean(scores))
    
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(k_values, accs, lw=3)
plt.show()

Out[75]:

In [76]:

print(np.max(accs))

Out[76]:

0.8580892269515022

In [55]:

# Its basically exactly the same. 
# However the peak performance comes at about k=9
# Given that our model using imputed data scores almost exactly the same
# and requires more neighbors to come reach optimal accuracy
# we would either want to investigate other means of imputation
# or
# utilize the original model as it is computationally more efficient.

In [ ]:

KNN Classification and Imputation: Cell Phone Churn Data

1. Load the cell phone "churn" data containing some missing values.

2. Examine the data. What columns have missing values?

3. Convert the `vmail_plan` and `intl_plan` colums to binary integer columns.

4. Create dummy coded columns for state and concatenate it to the churn dataset.

5. Create a version of the churn data that has no missing values.

6. Create a target vector and predictor matrix.

7. Calculate the baseline accuracy for `churn`.

8. Cross-validate a KNN model predicting `churn`.

9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

10. Imputing with KNN

11. Impute the missing values for `vmail_message` using the same process.

12. Given the accuracy (and $R^2$ ) of your best imputation models when finding the best K neighbors, do you think imputing is a good idea?

13. With the imputed dataset, cross-validate the accuracy predicting churn. Is it better? Worse? The same?

Product

Resources

Company

KNN Classification and Imputation: Cell Phone Churn Data

1. Load the cell phone "churn" data containing some missing values.

2. Examine the data. What columns have missing values?

3. Convert the vmail_plan and intl_plan colums to binary integer columns.

4. Create dummy coded columns for state and concatenate it to the churn dataset.

5. Create a version of the churn data that has no missing values.

6. Create a target vector and predictor matrix.

7. Calculate the baseline accuracy for churn.

8. Cross-validate a KNN model predicting churn.

9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

10. Imputing with KNN

11. Impute the missing values for vmail_message using the same process.

12. Given the accuracy (and R2R^2R2) of your best imputation models when finding the best K neighbors, do you think imputing is a good idea?

13. With the imputed dataset, cross-validate the accuracy predicting churn. Is it better? Worse? The same?

3. Convert the `vmail_plan` and `intl_plan` colums to binary integer columns.

7. Calculate the baseline accuracy for `churn`.

8. Cross-validate a KNN model predicting `churn`.

11. Impute the missing values for `vmail_message` using the same process.

12. Given the accuracy (and $R^2$ ) of your best imputation models when finding the best K neighbors, do you think imputing is a good idea?