Path: blob/master/2019-spring/slides/11_clustering.ipynb
2051 views
DSCI 100 - Introduction to Data Science
Lecture 11 - Introduction to Clustering
2019-03-28
A quick refresher of previous topics!
Supervised Learning
A Machine learning task of training a model on labeled data which can be used to predict an output.
Classification
Example Problems:
Predicting type of fruit
Predicting Yes/No to kidney disease
Predicting credit card fraud
DSCI 100 Alogrithms:
KNN classification
Regression
Example Problems:
Predicting house prices
Predicting movie box office sales
Predicting Olympic medal count
DSCI 100 Algorithms:
KNN Regression
Linear Regression
Now let's consider a different type of Machine Learning!
Unsupervised Learning
A Machine learning task of finding structure and patterns in unlabeled data.
Clustering
Dimensionality Reduction
Clustering
Within the Unsupervised world, the task of grouping objects that are similar to each other is called Clustering
Examples
Clustering similar People (Market segmentation) or similar Movies
Examples
Clustering similar products
Quick Test by show of hands! (hands up for clustering problem)
Estimate the score a student will get on the MCAT
Determine if there are patterns in the people who take the MCAT
Separate photos into groups based on the labeled location the photo was taken
Separate photos into 5 distinct piles
Quick Test - Answers
Estimate the score a student will get on the MCAT -Regression
Determine if there are patterns in the people who take the MCAT -Clustering
Separate photos into groups based on the labeled location the photo was taken -Classification
Separate photos into 5 distinct piles -Clustering
K - Means Clustering algorithm
Specify k number of clusters to group the observations
Initialize cluster assignments by randomization
Calculate average coordinates for each cluster center
Label clusters based on distances from the cluster center
Recalcualte average coordinates for each cluster center given the new labels
Repeat steps 4 and 5 until there is no change in the labels
Model Selection
So far, to measure a models goodness of fit we have looked at accuracy/error rate for classification, and RMSE/RMSPE for regression. For clustering, we have to use other measures because there is not a clearly defined answer/label we can compare to. (Unsupervised)
Within-cluster sum of squares
The sum of the squared deviations between each observation and the cluster centers
The larger values suggest a spread out cluster, compared to a small value which suggests a more compact cluster
A value for each cluster
Total within-cluster sum of squares
The total sum of the within-cluster SS
Used to evaluate the overall model variability and to compare this to other models
One value for each model
Model Selection
Other model parameters of K-means and their interpretations
Total Sum of Squares
The total sum of squares between each point and the overall center of all the points
Between Sum of Squares
The sum of squared distances between each cluster center and the overall center of the points
If there are no very clear discernible groups/clusters the between sum of squares will be a very small fraction of the Total Sum of Squares
Model Selection
Model Selection
Model Selection
Randomization
The initial randomization of cluster labels is relevant to the final clustering.
The K means algorithm doesnt not converge to a single identical and best answer everytime you run it
It is important to
set.seed()
and utilize thenstart
argument in R's implementation of the algorithm.
kmeans(data, centers, **nstart**, ...)
Multivariate K Means Clustering
We considered K Means in the case of two features for ease of visualization. But this can be extended to many features. (Remeber Netflix and Amazon Examples)
Advantages and Disadvantages
Advantages
Easy to implement and interpret
Simple to understand
Computationally more efficient than other clustering algorithms
Disadvantages
Need to specify K
No Optimal Solution
Dependant on Initialization and Lacks Consistency
Sensitive to scale of features (Need to normalize/standardize)