Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_07/code/knn_classification_imputation-lab.ipynb
1904 views
Kernel: Python 3

KNN Classification and Imputation: Cell Phone Churn Data

Authors: Kiefer Katovich (SF)


In this lab you will practice using KNN for classification (and a little bit for regression as well).

The dataset is one on "churn" in cell phone plans. It has information on the usage of the phones by different account holders and whether or not they churned or not.

Our goal is to predict whether a user will churn or not based on the other features.

We will also be using the KNN model to impute missing data. There are a couple of columns in the dataset with missing values, and we can build KNN models to predict what those missing values will most likely be. This is a more advanced imputation method than just filling in the mean or median.

import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns %matplotlib inline %config InlineBackend.figure_format = 'retina' from sklearn.neighbors import KNeighborsClassifier

1. Load the cell phone "churn" data containing some missing values.

churn = pd.read_csv('../assets/data/churn_missing.csv')

2. Examine the data. What columns have missing values?

Remember to use our standard 4 or 5 commands (head, describe, info, isnull, dtypes....)

# A: churn.head()
churn.describe()
churn.isnull().sum()
state 0 account_length 0 area_code 0 intl_plan 0 vmail_plan 400 vmail_message 400 day_mins 0 day_calls 0 day_charge 0 eve_mins 0 eve_calls 0 eve_charge 0 night_mins 0 night_calls 0 night_charge 0 intl_mins 0 intl_calls 0 intl_charge 0 custserv_calls 0 churn 0 dtype: int64
churn.dtypes
state object account_length int64 area_code int64 intl_plan object vmail_plan object vmail_message float64 day_mins float64 day_calls int64 day_charge float64 eve_mins float64 eve_calls int64 eve_charge float64 night_mins float64 night_calls int64 night_charge float64 intl_mins float64 intl_calls int64 intl_charge float64 custserv_calls int64 churn bool dtype: object

3. Convert the vmail_plan and intl_plan colums to binary integer columns.

Make sure that if a value is missing that you don't fill it in with a new value! Preserve the missing values.

# A Some code to help you - turns vmail_plan into 0 or 1 column value churn.loc[:,'vmail_plan'] = churn.vmail_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x) churn.loc[:,'intl_plan'] = churn.intl_plan.map(lambda x: 1 if x == 'yes' else 0 if x == 'no' else x)
churn['intl_plan'].dtypes
dtype('int64')
churn.state.value_counts()
WV 106 MN 84 NY 83 AL 80 OR 78 WI 78 OH 78 WY 77 VA 77 CT 74 ID 73 MI 73 VT 73 TX 72 UT 72 IN 71 KS 70 MD 70 NJ 68 MT 68 NC 68 NV 66 WA 66 CO 66 RI 65 MS 65 MA 65 AZ 64 MO 63 FL 63 ND 62 NM 62 ME 62 NE 61 OK 61 DE 61 SC 60 SD 60 KY 59 IL 58 NH 56 AR 55 GA 54 DC 54 TN 53 HI 53 AK 52 LA 51 PA 45 IA 44 CA 34 Name: state, dtype: int64

4. Create dummy coded columns for state and concatenate it to the churn dataset.

Remember: You will need to leave out one of the state dummy coded columns to serve as the "reference" column since we will be using these for modeling.

use pd.get_dummies(..., drop_first = True)

# A: states = pd.get_dummies(churn.state, drop_first=True) states.head(3) #dummies always drops the first category, in this case AK
states.shape #this dataset has 51 states, 50 states plus DC = 51
(3333, 50)
## Concatenate back to a single dataset churn = pd.concat([churn, states], axis=1)

5. Create a version of the churn data that has no missing values.

Use dropna(). Calculate the shape

# A: churn_nona = churn.dropna() churn_nona.info() churn_nona.shape
<class 'pandas.core.frame.DataFrame'> Int64Index: 2933 entries, 0 to 3332 Columns: 120 entries, state to WY dtypes: bool(1), float64(10), int64(8), object(1), uint8(100) memory usage: 747.6+ KB
(2933, 120)

6. Create a target vector and predictor matrix.

  • Target should be the churn column.

  • Predictor matrix should be all columns except area_code, state, and churn.

# A: X = churn_nona.drop(['area_code','state','churn'], axis =1) y = churn_nona.churn.values

7. Calculate the baseline accuracy for churn.

What percent of the churn target values (y) == 1? (this is just the average value of the column. Why is that?)

# A: y.mean() #y.sum() #y.sum()/len(y) churn_nona.churn.mean() #less than 0.5
0.14353903852710534

8. Cross-validate a KNN model predicting churn.

  • Number of neighbors should be 5.

  • Make sure to standardize the predictor matrix.

  • Set cross-validation folds to 10.

Report the mean cross-validated accuracy.

# A:

9. Iterate from k=1 to k=49 (only odd k) and cross-validate the accuracy of the model for each.

Plot the cross-validated mean accuracy for each score. What is the best accuracy?

# A:k_values = list(range(1,50,2)) accs = [] for k in k_values: knn = #fill in here scores = cross_val_score(#fill in here) accs.append(np.mean(#fill in here))
fig, ax = plt.subplots(figsize=(8,5)) ax.plot(k_values, accs, lw=3) plt.show() print(np.max(accs))