CoCalc -- 11_clustering.ipynb

GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/slides/11_clustering.ipynb
²⁰⁵¹ views

Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 11 - Introduction to Clustering

2019-03-28

In [1]:

library(tidyverse)

Out[1]:

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0       ✔ purrr   0.3.1  
✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0  
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

A quick refresher of previous topics!

Supervised Learning

A Machine learning task of training a model on labeled data which can be used to predict an output.

Classification

Example Problems:

Predicting type of fruit
Predicting Yes/No to kidney disease
Predicting credit card fraud

DSCI 100 Alogrithms:

KNN classification

Regression

Example Problems:

Predicting house prices
Predicting movie box office sales
Predicting Olympic medal count

DSCI 100 Algorithms:

KNN Regression
Linear Regression

Now let's consider a different type of Machine Learning!

Unsupervised Learning

A Machine learning task of finding structure and patterns in unlabeled data.

Clustering
Dimensionality Reduction

Clustering

Within the Unsupervised world, the task of grouping objects that are similar to each other is called Clustering

Examples

Clustering similar People (Market segmentation) or similar Movies

Examples

Clustering similar products

Quick Test by show of hands! (hands up for clustering problem)

Estimate the score a student will get on the MCAT
Determine if there are patterns in the people who take the MCAT
Separate photos into groups based on the labeled location the photo was taken
Separate photos into 5 distinct piles

Quick Test - Answers

Estimate the score a student will get on the MCAT -Regression
Determine if there are patterns in the people who take the MCAT -Clustering
Separate photos into groups based on the labeled location the photo was taken -Classification
Separate photos into 5 distinct piles -Clustering

K - Means Clustering algorithm

Specify k number of clusters to group the observations
Initialize cluster assignments by randomization
Calculate average coordinates for each cluster center
Label clusters based on distances from the cluster center
Recalcualte average coordinates for each cluster center given the new labels
Repeat steps 4 and 5 until there is no change in the labels

Model Selection

So far, to measure a models goodness of fit we have looked at accuracy/error rate for classification, and RMSE/RMSPE for regression. For clustering, we have to use other measures because there is not a clearly defined answer/label we can compare to. (Unsupervised)

Within-cluster sum of squares

The sum of the squared deviations between each observation and the cluster centers
The larger values suggest a spread out cluster, compared to a small value which suggests a more compact cluster
A value for each cluster

Total within-cluster sum of squares

The total sum of the within-cluster SS
Used to evaluate the overall model variability and to compare this to other models
One value for each model

Model Selection

Other model parameters of K-means and their interpretations

Total Sum of Squares

The total sum of squares between each point and the overall center of all the points

Between Sum of Squares

The sum of squared distances between each cluster center and the overall center of the points
If there are no very clear discernible groups/clusters the between sum of squares will be a very small fraction of the Total Sum of Squares

Model Selection

Randomization

The initial randomization of cluster labels is relevant to the final clustering.
The K means algorithm doesnt not converge to a single identical and best answer everytime you run it
It is important to set.seed() and utilize the nstart argument in R's implementation of the algorithm.

kmeans(data, centers, **nstart**, ...)

Multivariate K Means Clustering

We considered K Means in the case of two features for ease of visualization. But this can be extended to many features. (Remeber Netflix and Amazon Examples)

Advantages and Disadvantages

Advantages

Easy to implement and interpret
Simple to understand
Computationally more efficient than other clustering algorithms

Disadvantages

Need to specify K
No Optimal Solution
Dependant on Initialization and Lacks Consistency
Sensitive to scale of features (Need to normalize/standardize)

What did we learn today?

In [ ]:

DSCI 100 - Introduction to Data Science

Lecture 11 - Introduction to Clustering

2019-03-28

A quick refresher of previous topics!

Supervised Learning

Classification

Regression

Now let's consider a different type of Machine Learning!

Unsupervised Learning

Clustering

Examples

Examples

Quick Test by show of hands! (hands up for clustering problem)

Quick Test - Answers

K - Means Clustering algorithm

Model Selection

Model Selection

Model Selection

Model Selection

Model Selection

Randomization

Multivariate K Means Clustering

Advantages and Disadvantages

What did we learn today?

Product

Resources

Company