Path: blob/master/lessons/lesson_07/code/solution-code/knn_classification_imputation-lab-solutions - (done).ipynb
1904 views
KNN Classification and Imputation: Cell Phone Churn Data
Authors: Kiefer Katovich (SF)
In this lab you will practice using KNN for classification (and a little bit for regression as well).
The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.
Our goal is to predict whether a user will churn or not based on the other features.
We will also be using the KNN model to impute missing data. There are a couple of columns in the dataset with missing values, and we can build KNN models to predict what those missing values will most likely be. This is a more advanced imputation method than just filling in the mean or median.
1. Load the cell phone "churn" data containing some missing values.
2. Examine the data. What columns have missing values?
3. Convert the vmail_plan
and intl_plan
colums to binary integer columns.
Make sure that if a value is missing that you don't fill it in with a new value! Preserve the missing values.
4. Create dummy coded columns for state and concatenate it to the churn dataset.
Remember: You will need to leave out one of the state dummy coded columns to serve as the "reference" column since we will be using these for modeling.
5. Create a version of the churn data that has no missing values.
Calculate the shape
6. Create a target vector and predictor matrix.
Target should be the
churn
column.Predictor matrix should be all columns except
area_code
,state
, andchurn
.
7. Calculate the baseline accuracy for churn
.
8. Cross-validate a KNN model predicting churn
.
Number of neighbors should be 5.
Make sure to standardize the predictor matrix.
Set cross-validation folds to 10.
Report the mean cross-validated accuracy.
note that the knn model produced a result that was not necessarily better than the baseline
9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.
Plot the cross-validated mean accuracy for each score. What is the best accuracy?
10. Imputing with KNN
K-Nearest Neighbors can be used to impute missing values in datasets. What we will do is estimate the most likely value for the missing data based on a KNN model.
We have two columns with missing data:
vmail_plan
vmail_message
10.A: Create two subsets of the churn dataset: one without missing values for vmail_plan
and vmail_message
, and one with the missing values.
First we will impute values for vmail_plan
. This is a categorical column and so we will impute using classification (predicting whether the plan is yes or no, 1 vs. 0).
10.B: Create a target that is vmail_plan
and predictor matrix that is all columns except state
, area_code
, churn
, vmail_plan
, and vmail_message
.
Create a target (prediction or y value) and predictor matrix (x values).
Note: We don't include the
churn
variable in the model to impute. Why? We are imputing these missing values so that we can use the rows to predict churn with more data afterwards. If we imputed with churn as a predictor then we would be cheating.
this is aka data leakage, bc if the model knows what churn is you will be performing very well, simply bc the model already knows what to predict - collinearity
10.C: Standardize the predictor matrix.
10.D: Find the best K for predicting vmail_plan
.
You may want to write a function for this. What is the accuracy for predicting vmail_plan
at the best K? What is the baseline accuracy for vmail_plan
?
10.E: Fit a KNeighborsClassifier
with the best number of neighbors.
10.F: Predict the missing vmail_plan
values using the subset of the data where it is misssing.
You will need to:
Create a new predictor matrix using the same predictors but from the missing subset of data.
Standardize this predictor matrix using the StandardScaler object fit on the non-missing data. This means you will just use the
.transform()
function. It is important to standardize the new predictors the same way we standardized the original predictors if we want the predictions to make sense. Calling.fit_transform()
will reset the standardized scale.Predict what the missing vmail plan values should be.
Replace the missing values in the original with the predicted values.
Note: It may predict all 0's. This is OK. If you want to see the predicted probabilities of
vmail_plan
for each row you can use the.predict_proba()
function instead of.predict()
. You can use these probabilities to manually set the criteria threshold.
11. Impute the missing values for vmail_message
using the same process.
Since vmail_message
is essentially a continuous measure, you need to use KNeighborsRegressor
instead of the KNeighborsClassifier
.
KNN can do both regression and classification! Instead of "voting" on the class like in classification, the neighbors will average their value for the target in regression.