Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/slides/11_clustering.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 11 - Introduction to Clustering

2019-03-28

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ── ✔ ggplot2 3.1.0 ✔ purrr 0.3.1 ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1 ✔ tidyr 0.8.3 ✔ stringr 1.4.0 ✔ readr 1.3.1 ✔ forcats 0.4.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

A quick refresher of previous topics!

Supervised Learning

  • A Machine learning task of training a model on labeled data which can be used to predict an output.

Classification

Example Problems:

  • Predicting type of fruit

  • Predicting Yes/No to kidney disease

  • Predicting credit card fraud

DSCI 100 Alogrithms:

  • KNN classification

Regression

Example Problems:

  • Predicting house prices

  • Predicting movie box office sales

  • Predicting Olympic medal count

DSCI 100 Algorithms:

  • KNN Regression

  • Linear Regression

Now let's consider a different type of Machine Learning!

Unsupervised Learning

A Machine learning task of finding structure and patterns in unlabeled data.

  • Clustering

  • Dimensionality Reduction

Clustering

  • Within the Unsupervised world, the task of grouping objects that are similar to each other is called Clustering

Examples

Clustering similar People (Market segmentation) or similar Movies

Examples

Clustering similar products

Quick Test by show of hands! (hands up for clustering problem)

  1. Estimate the score a student will get on the MCAT

  2. Determine if there are patterns in the people who take the MCAT

  3. Separate photos into groups based on the labeled location the photo was taken

  4. Separate photos into 5 distinct piles

Quick Test - Answers

  1. Estimate the score a student will get on the MCAT -Regression

  2. Determine if there are patterns in the people who take the MCAT -Clustering

  3. Separate photos into groups based on the labeled location the photo was taken -Classification

  4. Separate photos into 5 distinct piles -Clustering

K - Means Clustering algorithm

  1. Specify k number of clusters to group the observations

  2. Initialize cluster assignments by randomization

  3. Calculate average coordinates for each cluster center

  4. Label clusters based on distances from the cluster center

  5. Recalcualte average coordinates for each cluster center given the new labels

  6. Repeat steps 4 and 5 until there is no change in the labels

Model Selection

So far, to measure a models goodness of fit we have looked at accuracy/error rate for classification, and RMSE/RMSPE for regression. For clustering, we have to use other measures because there is not a clearly defined answer/label we can compare to. (Unsupervised)

Within-cluster sum of squares

  • The sum of the squared deviations between each observation and the cluster centers

  • The larger values suggest a spread out cluster, compared to a small value which suggests a more compact cluster

  • A value for each cluster

Total within-cluster sum of squares

  • The total sum of the within-cluster SS

  • Used to evaluate the overall model variability and to compare this to other models

  • One value for each model

Model Selection

Other model parameters of K-means and their interpretations

Total Sum of Squares

  • The total sum of squares between each point and the overall center of all the points

Between Sum of Squares

  • The sum of squared distances between each cluster center and the overall center of the points

  • If there are no very clear discernible groups/clusters the between sum of squares will be a very small fraction of the Total Sum of Squares

Model Selection

Model Selection

Model Selection

Randomization

  • The initial randomization of cluster labels is relevant to the final clustering.

  • The K means algorithm doesnt not converge to a single identical and best answer everytime you run it

  • It is important to set.seed() and utilize the nstart argument in R's implementation of the algorithm.

kmeans(data, centers, **nstart**, ...)

Multivariate K Means Clustering

  • We considered K Means in the case of two features for ease of visualization. But this can be extended to many features. (Remeber Netflix and Amazon Examples)

Advantages and Disadvantages

Advantages

  • Easy to implement and interpret

  • Simple to understand

  • Computationally more efficient than other clustering algorithms

Disadvantages

  • Need to specify K

  • No Optimal Solution

  • Dependant on Initialization and Lacks Consistency

  • Sensitive to scale of features (Need to normalize/standardize)

What did we learn today?