Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/slides/07_classification_continued.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 7 - Classification II: Evaluating & Tuning

Housekeeping

  • Thanks for filling out the mid-course survey! We're collecting and processing responses.

  • Grades are posted for Tutorial 04, Quiz 1, and Worksheet 06

Continuing with the classification problem

Recall: cancer tumour cell data, with "benign" and "malignant" labels

Today: unanswered questions from last week

  1. Is our model any good? How do we evaluate it?

  2. How do we choose k in K-nearest neighbours classification?

Evaluating the Model

To add evaluation into our classification pipeline, we:

  1. Split our data into two subsets: training data and testing data.

Evaluating the Model

  1. Build the model & choose K using training data only

  2. Compute accuracy by predicting labels on testing data only

Why?

Showing your classifier the labels of evaluation data is like cheating on a test; it'll look more accurate than it really is

Golden Rule of Machine Learning / Statistics: Don't use your testing data to train your model!


Splitting Data

There are two important things to do when splitting data.

  1. Shuffling: randomly reorder the data before splitting

  2. Stratification: make sure the two split subsets of data have roughly equal proportions of the different labels

Why? Discuss in your groups for 1 minute!

(caret thankfully automatically does both of these things)

Choosing K (or, "tuning'' the model)

Want to choose K to maximize accuracy, but:

  • we can't use training data to evaluate accuracy (cheating!)

  • we can't use test data to evaluate accuracy (choosing K is part of training!)

Solution: Split the training data further into training data and validation data


2a. Choose some candidate values of K
2b. Train the model for each using training data only
2c. Evaluate accuracy for each using validation data only
2d. Pick the K that maximizes validation accuracy

Cross-Validation

We can get a better estimate of accuracy by splitting multiple ways and averaging

Underfitting & Overfitting

Overfitting: when your model is too sensitive to your training data; noise can influence predictions!

Underfitting: when your model isn't sensitive enough to training data; useful information is ignored!

Which of these are under-, over-, and good fits? Discuss in your groups for 1 minute!

Underfitting & Overfitting

For KNN: small K overfits, large K underfits, both cause poor accuracy

The Big Picture

Worksheet Time!

...and if we've learned anything from last time,



Class Activity

In your group, discuss the following prompts. Post your group's answer on Piazza:

  • Explain what a test, validation and training data set are in your own words

  • Explain cross-validation in your own words

  • Imagine if we train and evaluate accuracy on all the data. How can I get 100% accuracy, always?

  • Why can't I use cross validation when testing?