Path: blob/master/lessons/lesson_07/code/knn_classification_imputation-lab.ipynb
1904 views
KNN Classification and Imputation: Cell Phone Churn Data
Authors: Kiefer Katovich (SF)
In this lab you will practice using KNN for classification (and a little bit for regression as well).
The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.
Our goal is to predict whether a user will churn or not based on the other features.
We will also be using the KNN model to impute missing data. There are a couple of columns in the dataset with missing values, and we can build KNN models to predict what those missing values will most likely be. This is a more advanced imputation method than just filling in the mean or median.
1. Load the cell phone "churn" data containing some missing values.
2. Examine the data. What columns have missing values?
Remember to use our standard 4 or 5 commands (head, describe, info, isnull, dtypes....)
3. Convert the vmail_plan
and intl_plan
colums to binary integer columns.
Make sure that if a value is missing that you don't fill it in with a new value! Preserve the missing values.
4. Create dummy coded columns for state and concatenate it to the churn dataset.
Remember: You will need to leave out one of the state dummy coded columns to serve as the "reference" column since we will be using these for modeling.
use pd.get_dummies(..., drop_first = True)
5. Create a version of the churn data that has no missing values.
Use dropna(). Calculate the shape
6. Create a target vector and predictor matrix.
Target should be the
churn
column.Predictor matrix should be all columns except
area_code
,state
, andchurn
.
7. Calculate the baseline accuracy for churn
.
What percent of the churn target values (y) == 1? (this is just the average value of the column. Why is that?)
8. Cross-validate a KNN model predicting churn
.
Number of neighbors should be 5.
Make sure to standardize the predictor matrix.
Set cross-validation folds to 10.
Report the mean cross-validated accuracy.
9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.
Plot the cross-validated mean accuracy for each score. What is the best accuracy?