CoCalc -- 07_classification

GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/slides/07_classification_continued.ipynb
²⁰⁵¹ views

Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 7 - Classification II: Evaluating & Tuning

Housekeeping

Thanks for filling out the mid-course survey! We're collecting and processing responses.
Grades are posted for Tutorial 04, Quiz 1, and Worksheet 06

Continuing with the classification problem

Recall: cancer tumour cell data, with "benign" and "malignant" labels

Today: unanswered questions from last week

Is our model any good? How do we evaluate it?
How do we choose k in K-nearest neighbours classification?

Evaluating the Model

To add evaluation into our classification pipeline, we:

Split our data into two subsets: training data and testing data.

Evaluating the Model

Build the model & choose K using training data only
Compute accuracy by predicting labels on testing data only

Why?

Showing your classifier the labels of evaluation data is like cheating on a test; it'll look more accurate than it really is

Golden Rule of Machine Learning / Statistics: Don't use your testing data to train your model!

Splitting Data

There are two important things to do when splitting data.

Shuffling: randomly reorder the data before splitting
Stratification: make sure the two split subsets of data have roughly equal proportions of the different labels

Why? Discuss in your groups for 1 minute!

(caret thankfully automatically does both of these things)

Choosing K (or, "tuning'' the model)

Want to choose K to maximize accuracy, but:

we can't use training data to evaluate accuracy (cheating!)
we can't use test data to evaluate accuracy (choosing K is part of training!)

Solution: Split the training data further into training data and validation data

2a. Choose some candidate values of K
2b. Train the model for each using training data only
2c. Evaluate accuracy for each using validation data only
2d. Pick the K that maximizes validation accuracy

Cross-Validation

We can get a better estimate of accuracy by splitting multiple ways and averaging

Underfitting & Overfitting

Overfitting: when your model is too sensitive to your training data; noise can influence predictions!

Underfitting: when your model isn't sensitive enough to training data; useful information is ignored!

Which of these are under-, over-, and good fits? Discuss in your groups for 1 minute!

Underfitting & Overfitting

For KNN: small K overfits, large K underfits, both cause poor accuracy

The Big Picture

Worksheet Time!

...and if we've learned anything from last time,

Class Activity

In your group, discuss the following prompts. Post your group's answer on Piazza:

Explain what a test, validation and training data set are in your own words
Explain cross-validation in your own words
Imagine if we train and evaluate accuracy on all the data. How can I get 100% accuracy, always?
Why can't I use cross validation when testing?

DSCI 100 - Introduction to Data Science

Lecture 7 - Classification II: Evaluating & Tuning

Housekeeping

Continuing with the classification problem

Today: unanswered questions from last week

Evaluating the Model

Evaluating the Model

Why?

Splitting Data

Choosing K (or, "tuning'' the model)

Cross-Validation

Underfitting & Overfitting

Underfitting & Overfitting

The Big Picture

Worksheet Time!

...and if we've learned anything from last time,

Class Activity

Product

Resources

Company