Path: blob/master/clustering_old/clustering/clustering.Rmd
2614 views
------To follow along, all the code to the documentation can be found here.
When working with k-means and hierarchichal clustering (this documentation assumes that you're familiar with these two algorithms), Calinski-Harabasz index and boostrap evaluation are two useful functions. The former is a kind of estimate that can help us chooses the proper clustering number before performing the algorithm. The later is used for evaulating the stability of the clustering result.
Diving In
Here we use the built-in mtcars dataset as an example. Note that for clustering algorithms, units of the variables do matter. When the columns' units comes in different magnitudes ( e.g. weight is recorded in kilograms, height is recorded in centimeters ), a common approach to resolve this issue is to normalize the dataset. Here we use the scale function for doing this, which transforms all the variables to have a mean value of 0 and a standard deviation of 1.
Choosing the Right Clustering Number
We start off by using the hierarchical clustering method in R to cluster the scaled data, given that the dataset only contains r nrow(mtcars_scaled) rows, we can use the dendogram to visualize clustering result, and from the plot below, clustering this dataset into four groups seems to be quite reasonable and straighforward.
Now what if we have let's say one thousand rows of data or more, then the dendogram will probably be a blur and will most likely be useless in helping us choose the suitable cluster number (k) to group the data. Not to mention when you're working with kmeans where you have to specify the k before you run the algorithm.
So here we'll be using two measures, the Calinski-Harabasz Index, or known as the variance ratio criterion and Total within Sum of Squares for choosing the suitable k. Let's use the function and visualize both measures first, so it'll be easier to explain.
CHCriterion Function that calculates both measures for different values of k. Input parameters :
dataQuite straightfoward, your dataset. data.frame or matrix type data will both work.kmaxMaximum cluster number, caculates both measures from 2 cluster to kmax cluster (clustering the data into 1 cluster is the same as performing no clustering, therefore its measures are ommitted).clustermethodSpecify either "hclust" or "kmeanspp" for calculating the measures for heirarchichal or kmeans++ clustering. For those of you who are not famaliar with kmeans++, it is a simple algorithm built on top of kmeans so that it can generate a better k initial random center. You can refer to this link for some unformal explanation of how it works and the simple code to the algorithm. If it's still unclear then perhaps google is your best friend. Note that under the hood, the helper function that we've sourced in also source in theKmeansppfunction from another script....After specifying the "clustermethod", you can pass in other parameters into the R "hclust" and "kmeans" function. Please refer to the help page (?hclust or ?kmeans in R) for more information.The function returns a list consisting of:
data : Data frame of three columns, consisting of cluster number from 2 to the specified kmax and its corresponding value for both measures, each in one column.
plot : A side by side facet plot visualizing both measures' value.
Total within Sum of Squares : The math formula to the measure.
Where k denotes the number of clusters, x is the data point, is the ith cluster, is the centroid of cluster i, and is the L2 norm (Euclidean distance) between the two vectors.
The calculation of the formula can be divided into two small parts. The within sum of squares for a single cluster is the squared distance (note that it is "squared" distance!, do not square root it) of each point in the cluster from that cluster’s centroid (centroid is obtained by taking mean value of all the points in that cluster). And the total within sum of squares is the sum of the within sum of squares of all the clusters. For this measure, it will decrease as the number of clusters increases, because each cluster will be smaller and tighter. So what we're hoping for is that the measure will keep on decreasing up till the optimal cluster number, and the decrease will start to flat out after that. So looking back at the plot on the right, if you look closely there's an "elbow" at cluster 4, where the magnitude of the decease starts dropping. This "elbow", however can sometimes be hard to see and can be quite subjective.
Calinski-Harabasz Index : The math formula to the measure.
Where k is the number of clusters, and N is the total number of observations (data points), is the overall within-cluster variance (equivalent to the total within sum of squares calculated above), is the overall between-cluster variance.
You can calculate by using the total sum of squares (tss) minus . For a given dataset, the total sum of squares (tss) is the squared distance of all the data points from the dataset’s centroid, this measure is independent with the number of cluster.
For clarity, measures the variance of all the cluster centroids from the dataset’s grand centroid (A big value means that the centroid of each cluster will be spread out and they are not too close to each other), and given that we already know will keep on decreasing as the number of clusters goes up. Therefore, for the Calinski-Harabasz Index, the ratio of should be the biggest that at the optimal clustering size.
Looking back at the plot on the left, you can clearly see that this measure is the largest at the cluster size of 4. One important thing for using this measure is that sometimes you'll see that it reaches the optimal at cluster 2, however, grouping the data point into 2 cluster might not be ideal to you, when that happens the local maximum measure (the measure will drop and rise again) should be your ideal cluster number.
For the scaled mtcars dataset, the Total within Sum of Squares, Calinski-Harabasz Index and the straightforward way of looking at the dendogram all suggests a cluster size of 4. But this probably won't be true for every single dataset. When that happens it is up to you to decide which heuristic's suggestion on the cluster size is more reasonable.
Bootstrap Evaluation
Now that we've decided the suitable number of cluster (k) for our clustering algorithm, next we'll use bootstrap method to evaluate the stability of the clustering result. To be specific why we're doing this : Often times, clustering algorithms will produce several clusters that represents actual grouping of the data, and then one or two clusters that represents "others". Meaning that they're made up of data points that have no relationship with each other they just don't fit anywhere else. So here we'll be using boostrap to detect which cluster is considered to be that "other" cluster. Steps listed in the following :
Cluster the original data.
Draw a new dataset of the same size as the original by resampling the original dataset with replacement, therefore some data point may show up more than once, while others not at all. Cluster this new data.
For every cluster in the original cluster, find the most similar cluster in the new clustering, which is the one with the maximum jaccard similarity ( given two vectors, the jaccard similarity is the intersect / union, please look it up if it's still unclear ).
Repeat step 2 and 3 for a user-specified number of bootstrap iteration.
ClusterBootstrap Function that implements the boostrap. Input parameters :
dataYour data.frame or matrix type data, data.frame is preferred, the function will convert matrix type data to data.frame under the hood.kSpecify the number of cluster for your clustering algorithm.noise.cutIf specified, the points of the resulting cluster whose number is smaller than this number will be considered as noise, and all of these noise cluster will be grouped together as one whole cluster. Default to 0. (Not used in this documentation).bootstrapNumber of boostrap iteration. Default to 100.dissolveIf the jaccard similarity is smaller than this number, then it is considered to be "dissolved", that is, it did not show up in the new cluster. A cluster that dissolved too often is most likely not a real cluster. Default to 0.5 .clustermethodSame as the "CHCriterion" function above, specify either "hclust" or "kmeanspp" for calculating the measures for heirarchichal or kmeans++ clustering....Same as the "CHCriterion" function above, other parameters that can pass in into the R "hclust" and "kmeans" function.The function returns a list consisting of :
result : The original clustering result object.
bootmean : Mean of the jaccard similarity for each cluster for the specified bootstrap iteration.
partition : The original clustering result, a vector specifying which group does the data point belong.
clusternum : Final number of cluster (k), if you specified noise.cut then it might be different from k.
bootdissolved : Number of times each cluster's jaccard similarity is smaller than the specified dissolve value.
Here we'll use the "kmeanspp" as our clustermethod so we can demonstrate some ... parameters you can pass in to the kmeans function.
From the values of bootdissolved (denotes the number of time each cluster "dissolved") and the bootmean value, we can infer that having a low bootmean and high bootdissolved value, cluster r which.max(boot_clust$bootdissolved) has the characteristics of what we’ve been calling the “other” cluster. Therefore, it is quite likely that it is not an actual cluster, it simply don't belong to anywhere else.