Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_07/code/solution-code/knn_classification_imputation-lab-solutions - (done).ipynb
1904 views
Kernel: Python 3

KNN Classification and Imputation: Cell Phone Churn Data

Authors: Kiefer Katovich (SF)


In this lab you will practice using KNN for classification (and a little bit for regression as well).

The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.

Our goal is to predict whether a user will churn or not based on the other features.

We will also be using the KNN model to impute missing data. There are a couple of columns in the dataset with missing values, and we can build KNN models to predict what those missing values will most likely be. This is a more advanced imputation method than just filling in the mean or median.

import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns %matplotlib inline %config InlineBackend.figure_format = 'retina' from sklearn.neighbors import KNeighborsClassifier

1. Load the cell phone "churn" data containing some missing values.

churn = pd.read_csv('../../assets/data/churn_missing.csv')

2. Examine the data. What columns have missing values?

churn.head(3)
churn.shape
(3333, 20)
churn.isnull().sum()
state 0 account_length 0 area_code 0 intl_plan 0 vmail_plan 400 vmail_message 400 day_mins 0 day_calls 0 day_charge 0 eve_mins 0 eve_calls 0 eve_charge 0 night_mins 0 night_calls 0 night_charge 0 intl_mins 0 intl_calls 0 intl_charge 0 custserv_calls 0 churn 0 dtype: int64
# about 12% ofvmail plan & message are null
churn.dtypes
state object account_length int64 area_code int64 intl_plan object vmail_plan object vmail_message float64 day_mins float64 day_calls int64 day_charge float64 eve_mins float64 eve_calls int64 eve_charge float64 night_mins float64 night_calls int64 night_charge float64 intl_mins float64 intl_calls int64 intl_charge float64 custserv_calls int64 churn bool dtype: object
churn.intl_plan.value_counts(dropna=False)
no 3010 yes 323 Name: intl_plan, dtype: int64
churn.vmail_plan.value_counts(dropna=False) #dropna=True will drop na's
no 2130 yes 803 NaN 400 Name: vmail_plan, dtype: int64
churn['state'].value_counts()
WV 106 MN 84 NY 83 AL 80 OR 78 WI 78 OH 78 WY 77 VA 77 CT 74 MI 73 ID 73 VT 73 UT 72 TX 72 IN 71 MD 70 KS 70 MT 68 NJ 68 NC 68 WA 66 NV 66 CO 66 RI 65 MS 65 MA 65 AZ 64 FL 63 MO 63 NM 62 ND 62 ME 62 NE 61 OK 61 DE 61 SC 60 SD 60 KY 59 IL 58 NH 56 AR 55 DC 54 GA 54 TN 53 HI 53 AK 52 LA 51 PA 45 IA 44 CA 34 Name: state, dtype: int64
#Note: DC is being counted as a state

3. Convert the vmail_plan and intl_plan colums to binary integer columns.

Make sure that if a value is missing that you don't fill it in with a new value! Preserve the missing values.

churn.loc[:,'vmail_plan'] = churn.vmail_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x) churn.loc[:,'intl_plan'] = churn.intl_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x)

4. Create dummy coded columns for state and concatenate it to the churn dataset.

Remember: You will need to leave out one of the state dummy coded columns to serve as the "reference" column since we will be using these for modeling.

states = pd.get_dummies(churn.state, drop_first=True) states.head(3) # drop_first removed Alaska(AK) from being converted to a dummy column
states.shape
(3333, 50)
print(len(churn.state.unique()))
51
churn = pd.concat([churn, states], axis=1) #smooshed side by side, otherwise 0 would be top and bottom

5. Create a version of the churn data that has no missing values.

Calculate the shape

churn_nona = churn.dropna() churn_nona.shape
(2933, 70)

6. Create a target vector and predictor matrix.

  • Target should be the churn column.

  • Predictor matrix should be all columns except area_code, state, and churn.

X = churn_nona.drop(['area_code','state','churn'], axis =1) y = churn_nona.churn.values

7. Calculate the baseline accuracy for churn.

churn_nona.churn.mean() # Less than 0.5
0.14353903852710534
baseline = 1. - churn_nona.churn.mean() #if you entered zero as your prediction, you would hit 85% accuracy print(baseline)
0.8564609614728946

8. Cross-validate a KNN model predicting churn.

  • Number of neighbors should be 5.

  • Make sure to standardize the predictor matrix.

  • Set cross-validation folds to 10.

Report the mean cross-validated accuracy.

knn = KNeighborsClassifier(n_neighbors=5)
from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler
ss = StandardScaler() #this like a z score, this is a class Xs = ss.fit_transform(X) #fit_transform is a function within the stabdardscaler class, is an operation that fits the scaler, and transforms the results on the x values
scores = cross_val_score(knn, Xs, y, cv=10) #knn=model, Xs=x values transformed, y=actual y values, cv=how many folds you want print(scores) print(np.mean(scores))
[0.84745763 0.86054422 0.86006826 0.85665529 0.87030717 0.85665529 0.85665529 0.85324232 0.85665529 0.85665529] 0.8574896042757937

note that the knn model produced a result that was not necessarily better than the baseline

9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

Plot the cross-validated mean accuracy for each score. What is the best accuracy?

k_values = list(range(1,50,2)) #adds 2 every time, so at 1 it will equal 3, and at 3 it wil equal 5 etc. accs = [] for k in k_values: knn = KNeighborsClassifier(n_neighbors=k) #this initiates the knn model scores = cross_val_score(knn, Xs, y, cv=10) #this runs the knn model 10 times accs.append(np.mean(scores)) #this gives the mea of the 10 models we just ran
fig, ax = plt.subplots(figsize=(8,5)) ax.plot(k_values, accs, lw=3) #lw is line width, can play with # to increase or decrease plt.show() print(np.max(accs))
Image in a Jupyter notebook
0.8581698921253003
# it looks like there is a peak at about k=7 and then levels out after k=17, optimal k value = 7

10. Imputing with KNN

K-Nearest Neighbors can be used to impute missing values in datasets. What we will do is estimate the most likely value for the missing data based on a KNN model.

We have two columns with missing data:

  • vmail_plan

  • vmail_message

10.A: Create two subsets of the churn dataset: one without missing values for vmail_plan and vmail_message, and one with the missing values.

from sklearn.neighbors import KNeighborsRegressor missing_cols = ['vmail_plan','vmail_message']
impute_missing = churn.loc[churn.vmail_plan.isnull(), :] #this gives all the rows where vmail_plain is null and all the columns impute_valid = churn.loc[~churn.vmail_plan.isnull(), :] #this is the inverse

First we will impute values for vmail_plan. This is a categorical column and so we will impute using classification (predicting whether the plan is yes or no, 1 vs. 0).

10.B: Create a target that is vmail_plan and predictor matrix that is all columns except state, area_code, churn, vmail_plan, and vmail_message.

Create a target (prediction or y value) and predictor matrix (x values).

Note: We don't include the churn variable in the model to impute. Why? We are imputing these missing values so that we can use the rows to predict churn with more data afterwards. If we imputed with churn as a predictor then we would be cheating.

this is aka data leakage, bc if the model knows what churn is you will be performing very well, simply bc the model already knows what to predict - collinearity

impute_cols = [c for c in impute_valid.columns if not c in ['state','area_code','churn']+missing_cols] #columns to keep, for column in .... if not column in ... ... ... + ... ..., go to front and keep column y = impute_valid.vmail_plan.values X = impute_valid[impute_cols]

10.C: Standardize the predictor matrix.

ss = StandardScaler() Xs = ss.fit_transform(X)
X.columns
Index(['account_length', 'intl_plan', 'day_mins', 'day_calls', 'day_charge', 'eve_mins', 'eve_calls', 'eve_charge', 'night_mins', 'night_calls', 'night_charge', 'intl_mins', 'intl_calls', 'intl_charge', 'custserv_calls', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY'], dtype='object')

10.D: Find the best K for predicting vmail_plan.

You may want to write a function for this. What is the accuracy for predicting vmail_plan at the best K? What is the baseline accuracy for vmail_plan?

def find_best_k_cls(X, y, k_min=1, k_max=51, step=2, cv=5): k_range = list(range(k_min, k_max+1, step)) accs = [] for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, X, y, cv=cv) accs.append(np.mean(scores)) print(np.max(accs), np.argmax(k_range)) #prints the max and argmax (look into what argmax is) return np.argmax(k_range)
find_best_k_cls(Xs, y)
0.7262193952009117 25
25
impute_valid.vmail_plan.mean() #if you predicted all ones for vmail_plan, you would be right 27% of the time # less than 0.5
0.27378111148994205
vmail_plan_baseline = 1. - impute_valid.vmail_plan.mean() #if you predicted all zeros for vmail_plan, you would be right 73% of the time print(vmail_plan_baseline)
0.726218888510058
# we can already see that our calculated baseline is practically # identical to the best CV score which leads us down the path # to believe that there would be very little knowledge gain from # said model #AKA shitty model

10.E: Fit a KNeighborsClassifier with the best number of neighbors.

knn = KNeighborsClassifier(n_neighbors=25) knn.fit(Xs, y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=25, p=2, weights='uniform')

10.F: Predict the missing vmail_plan values using the subset of the data where it is misssing.

You will need to:

  1. Create a new predictor matrix using the same predictors but from the missing subset of data.

  • Standardize this predictor matrix using the StandardScaler object fit on the non-missing data. This means you will just use the .transform() function. It is important to standardize the new predictors the same way we standardized the original predictors if we want the predictions to make sense. Calling .fit_transform() will reset the standardized scale.

  • Predict what the missing vmail plan values should be.

  • Replace the missing values in the original with the predicted values.

Note: It may predict all 0's. This is OK. If you want to see the predicted probabilities of vmail_plan for each row you can use the .predict_proba() function instead of .predict(). You can use these probabilities to manually set the criteria threshold.

X_miss = impute_missing[impute_cols] X_miss_s = ss.transform(X_miss) # you don't fit again, bc you want to keep the mean the same as when you ran your first model, # and that is true for train test split too, so you just apply the same transformations # if you don't do this you will get an inaccurate fit as the model criteria have shifted
vmail_plan_impute = knn.predict(X_miss_s) #this is when you output your y hat values
vmail_plan_impute
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
# creating a DF copy to use to imputed missing values churn_imputed = churn.copy() # filling missing vmail_plan values with those predicted by KNN model churn_imputed.loc[churn.vmail_plan.isnull(), 'vmail_plan'] = vmail_plan_impute

11. Impute the missing values for vmail_message using the same process.

Since vmail_message is essentially a continuous measure, you need to use KNeighborsRegressor instead of the KNeighborsClassifier.

KNN can do both regression and classification! Instead of "voting" on the class like in classification, the neighbors will average their value for the target in regression.

def find_best_k_reg(X, y, k_min=1, k_max=51, step=2, cv=10): k_range = list(range(k_min, k_max+1, step)) r2s = [] for k in k_range: knn = KNeighborsRegressor(n_neighbors=k) scores = cross_val_score(knn, X, y, cv=cv) r2s.append(np.mean(scores)) print(np.max(r2s), np.argmax(k_range)) return np.argmax(k_range)
# x-true & y-true y = impute_valid.vmail_message.values X = impute_valid[impute_cols] # set and fit the scaler ss = StandardScaler() Xs = ss.fit_transform(X) # call/use find k-best function on known data best_k = find_best_k_reg(Xs, y) # apply k-best to fit model knn = KNeighborsRegressor(n_neighbors=best_k) knn.fit(Xs, y) # prepair rows with missing target values X_miss = impute_missing[impute_cols] X_miss_s = ss.transform(X_miss) # use model to predict unknown values vmail_message_impute = knn.predict(X_miss_s) vmail_message_impute # this r2 is not very good #do you mean the fit of the model? why is r2 brought up here? If it is accuracy, how can you have negative accuracy?
-0.034936418201589305 25
array([ 7.68, 8.28, 5.36, 8.48, 10.68, 6. , 8.68, 8.08, 9.32, 7.8 , 9.72, 3.8 , 8.88, 4.2 , 2.44, 11.72, 8.04, 6.88, 8.08, 7.84, 11.76, 7.56, 8.12, 11.84, 5.16, 9.88, 15.48, 10.04, 6.76, 11.72, 7.96, 7.68, 8.92, 8.88, 6.4 , 7.36, 9.72, 10.04, 11. , 5.16, 10.96, 7.4 , 4.92, 11.4 , 10.36, 10.56, 7.08, 8.72, 9.28, 6.24, 8.92, 7.84, 12.24, 6.84, 4.24, 10.8 , 5.72, 5.92, 4.76, 8.72, 6.88, 8.32, 6.92, 9.04, 7.36, 9.2 , 5.48, 2.92, 4. , 11.2 , 9.36, 7.84, 7.76, 3.6 , 8.2 , 9.32, 10.2 , 8.08, 7.8 , 10.12, 7.64, 12.08, 8.52, 3.28, 8.72, 7. , 2.8 , 9.12, 6.36, 9.2 , 11.88, 9. , 7.84, 10.8 , 9.44, 5.72, 10.36, 10.52, 3.72, 8.8 , 4.72, 6.32, 8.36, 5.44, 6.8 , 4.04, 3.88, 3.4 , 4.88, 5.28, 1.68, 9.28, 9.24, 9.28, 6.28, 9.8 , 5.68, 4.32, 3.8 , 8.4 , 9.12, 9.04, 4.2 , 9.88, 11.32, 4.84, 8.64, 11. , 11.2 , 9.6 , 6.12, 8.6 , 8.56, 4.24, 9.12, 7.24, 6.92, 7.44, 4.4 , 9.44, 5.8 , 7.52, 8.2 , 6.16, 8.44, 11.44, 7.52, 8.92, 10.08, 6.24, 9.76, 9.12, 7.56, 10.24, 8.04, 7.4 , 7.28, 4.12, 10.88, 4.32, 9.32, 4.64, 6.48, 4.48, 10.8 , 6.52, 10.44, 11.2 , 9.4 , 5.92, 8.36, 9.04, 10.68, 11.44, 10.88, 5.6 , 5.64, 7.12, 8.36, 7.64, 7.24, 7.24, 8. , 7.84, 9.12, 12.48, 3.52, 8.96, 6.88, 8.6 , 7.12, 8.04, 9.24, 7.48, 5.12, 5.08, 6.84, 6.36, 6.72, 9.24, 12.24, 7.08, 9.76, 11.08, 1.04, 7.48, 6.4 , 4.04, 10.08, 10.6 , 8.8 , 7.88, 7.4 , 6. , 5.52, 8.72, 7.68, 2.72, 8.12, 9.28, 8.8 , 4.4 , 7.64, 9.12, 8.2 , 6.28, 5.76, 8.2 , 9.68, 8.24, 8.28, 0.64, 7.56, 8.08, 4.8 , 11.36, 9.72, 7.28, 7.08, 8.52, 8.72, 7.92, 6.08, 9.2 , 9.36, 9.72, 7.2 , 9.56, 8.6 , 10.56, 6.88, 6.84, 8.6 , 5.08, 8.96, 10.8 , 6.08, 11.32, 8.28, 6.72, 5.16, 11.8 , 6.6 , 10. , 9.04, 9.28, 8.6 , 4.36, 4.96, 7.4 , 5.12, 12.72, 9.56, 8.96, 9.08, 6.56, 7.12, 6.8 , 7.96, 7.12, 11.48, 6.56, 8.4 , 9.64, 5.08, 3.48, 9. , 7.28, 7.36, 8.56, 8.56, 9.56, 7.88, 6.72, 9.64, 5.68, 8.76, 9.24, 12.28, 7.6 , 11.12, 10.56, 8.32, 9.24, 6.32, 7.68, 8.84, 1.64, 9.68, 7.96, 13.8 , 4.56, 8.04, 4.32, 13.04, 7.92, 4.36, 8.8 , 7.4 , 8.52, 13.72, 8.92, 8.16, 8.16, 6.08, 7.28, 7.16, 3.12, 7.08, 8.08, 10.04, 4.88, 6.2 , 10.24, 7.76, 5.96, 8.64, 7.52, 7.08, 8.76, 7.24, 4.92, 7.96, 7.36, 9.44, 5.32, 6.52, 7.48, 7.4 , 5.4 , 8.76, 5.76, 8. , 8.08, 8.2 , 8.76, 10.36, 12.92, 6.92, 5.04, 5.64, 12.24, 8.16, 8.88, 8.44, 3.56, 8.28, 9.72, 9.96, 8.36, 9. , 9.48, 7.72, 5.32, 5.72, 7.24, 9.8 , 6.16, 6.28, 9.96, 7.92, 7.32, 7.72, 7.92, 8.72, 7.32, 4.96, 13.04, 2.28, 5.52, 6.56, 6.72, 8.12, 7.2 , 6.2 , 12.68, 9.96, 8.76, 7.72, 11.48])
# assign predicted values to missing values in dataframe churn_imputed.loc[churn.vmail_message.isnull(), 'vmail_message'] = vmail_message_impute

12. Given the accuracy (and R2R^2) of your best imputation models when finding the best K neighbors, do you think imputing is a good idea?

# The accuracy and R2 are very bad. Thus our imputed values are most likely wrong with these models. # This doesn't necessarily mean that imputation is a bad idea, but we may want to consider # using a different method.

13. With the imputed dataset, cross-validate the accuracy predicting churn. Is it better? Worse? The same?

X = churn_imputed[[c for c in churn_nona.columns if not c in ['area_code','state','churn']]] y = churn_imputed.churn.values ss = StandardScaler() Xs = ss.fit_transform(X) k_values = list(range(1,50,2)) accs = [] for k in k_values: knn = KNeighborsClassifier(n_neighbors=k) scores = cross_val_score(knn, Xs, y, cv=10) accs.append(np.mean(scores)) fig, ax = plt.subplots(figsize=(8,5)) ax.plot(k_values, accs, lw=3) plt.show()
Image in a Jupyter notebook
print(np.max(accs))
0.8580892269515022
# Its basically exactly the same. # However the peak performance comes at about k=9 # Given that our model using imputed data scores almost exactly the same # and requires more neighbors to come reach optimal accuracy # we would either want to investigate other means of imputation # or # utilize the original model as it is computationally more efficient.