Path: blob/main/lab5/lab5-MixtureModels_K-means_Expectation-Maximisation.ipynb
340 views
Lab 4: Mixture Models, K-means and Expectation-Maximisation (EM)
In this lab we will focus on the modelling of data generated through a number of distinct distributions (mixture models) and how to extract information from data generated in this way through unsupervised clustering algorithms (k-means, expectation-maximisation).
First, we will import the required packages:
1) Mixture Models
In machine learning it is very common to construct a model of the observed data explicitly through the use of a distribution (e.g. Gaussian, Bernoulli, Categorical etc.). Provided the choice of the distribution type is reasonable, the observed data can often be used to estimate the parameters of the distribution (e.g. mean, variance etc.) that fit the data well. However, in many cases the data distribution is more complex and could potentially be multimodal which cannot be captured well with a unimodal distribution. Mixture models provide a conceptually simple solution to this problem by combining multiple distributions to form more complex ones.
What are mixture models? A mixture model is a probabilistic model used to represent variables that have a distribution derived from the combination of a number of different distributions, each corresponding to some subpopulation of the overall population. It is worth noting that within a mixture model it is not required for any observed data point to identify which subpopulation it belongs to.
Formally, each data point can be assumed to come from of distributions (subpopulations). A latent (hidden) variable is used to determine which of the subpopulation distributions is used and this is modelled by a categorical distribution with categories, , where is a vector of elements containing the probabilities of each category. Note that
For simplicity, in this example we will assume that each subpopulation distribution is Gaussian which makes this an example of a Gaussian Mixture Model (GMM). The parameters of each distribution are their mean and covariance matrix .
1.1) Sampling from a Gaussian mixture model
In this section we will go through the key steps involved in sampling from a Gaussian Mixture Model and use this code to generate some example data for us to work with. The code below defines the means and covariance matrices for three Gaussian distributions:
in addition to the parameters to the categorical distribution:
Sampling from the mixture model can be achieved by first sampling from the categorical distribution and then sampling from the corresponding Gaussian distribution. Note that this is another example of ancestral sampling seen in the previous lab.
Use the parameters provided to generate N
two-dimensional samples from the GMM and store it in a matrix X
with shape (N, 2)
. Hint: use either np.random.choice or np.random.multinomial (with n=1) to choose the categorical indicies and then sample using the multivariate_normal function.
Create a 2-D scatter plot of your samples in the cell below.
Exercises:
Change the values in
category_prob
vector (remember the values must sum to 1). What happens to the proportion of samples from each subpopulations?Keep track of the latent variable for each data point and colour them differently in the scatter plot (hint: the scatter function parameter
c
is used to colour the points and takes a vector ofN
integers)
2) K-means Clustering
K-means is a very simple and commonly used example of a clustering algorithm. Clustering is an unsupervised machine learning task in which the goal is to partition points within a dataset into distinct groups (or clusters) such that elements within the same group are similar to each other. K-means is an example of a hard-clustering method which means that each point can only be assigned to a single cluster and there are hard boundaries between clusters.
How does K-means work? The K-means algorithm involves an iterative procedure of multiple steps which can be summarised as follows:
Choose a value for the number of clusters, . This is an input to the algorithm and is fixed throughout. There is no correct way for this value to be chosen and it is typically done by either using some domain knowledge of the dataset, through trial and error or through other some other methods (e.g. Elbow Method, Silhouette Method).
Randomly initialise cluster centroids. These centroids are points in the data space that represent the mean of a cluster. Typically each of these centroids are initialised at the location of a randomly selected data point.
For each data point, compute the Euclidean distance to each cluster centroid. Assign the data point to the cluster corresponding to the closest centroid.
For each cluster, update the cluster centroid to be the mean of all of the data points assigned to that cluster.
Repeat steps 3 and 4 until the cluster assignments do not change.
For more information on K-means clustering, see Bishop section 9.1.
2.1 Implementing K-means
In this section your task is to implement the key steps of the K-means algorithm. Specifically:
Implement a
get_cluster_assignments
function that performs step 3 of the above algorithm. This function should return a vector (shape = (N,)
) containing the assigned cluster indices corresponding to each data point and should take the parameters:
X
- Your data points (shape = (N, 2)
)centroid_locations
- The location of each cluster centroid (shape = (K, 2)
)
Hint, use np.linalg.norm to find the Euclidean distances.
Implement a
get_cluster_centroids
function that performs step 4 of the above algorithm. This function should return a(K, 2)
matrix containing theK
cluster centroid locations and take the parameters:
X
- Your data points (shape = (N, 2)
)cluster_assignments
- The current cluster assignments for each data point (shape = (N,)
).
The cell below sets the value of K
(step 1) and performs the initialisation (step 2) of the cluster centroids by selecting K
data points at random. It then runs the remaining steps of the K-means algorithm with the functions you implemented. Make sure to read through this code to understand what it's doing and use it to test your implementation.
Exercises:
Run the algorithm 10 or more times. Are the resulting clusters always the same?
Experiment with changing the means, covariance matrices or both parameters when generating the original data. What happens to the consistency of the clustering if you make the cluster distributions overlap more?
2.2) K-means in scikit-learn
Thankfully we do not need to implement K-means from scratch each time and can use scikit-learn's KMeans
implementation.
Read through the documentation and run this implementation on the same data. Plot the final result using the plot_k_means
function (hint: the centroid locations can be accessed through the cluster_centers_
attribute).
2.3) Finding a good K value
As mentioned before, choosing a good K value for your data may not be straightforward. However, for any given clustering it is possible to assign a score based on the sum of squared distances between the data points and their cluster's centroid across the entire dataset. This score gives a measure of how similar data points are to their centroids. The goal is to find the smallest value for K that also gives a reasonably small sum of squared distances.
Use scikit-learn to run K-means with values of K
between 1 and 10. Create a plot showing the relationship between the K
value and the score (hint: this score can be computed using the score
function).
The Elbow Method refers to visually choosing K at the point where there are diminishing returns for increasing K (i.e. the elbow if you think of the plot as an arm). What value for K would you chose for this data?
3) Expectation-Maximisation algorithm
One way to think about Expectation-Maximisation (EM) is similar to clustering methods like K-means.
However, there are some key differences:
As we saw with K-means it is a hard-clustering method that assigns each point to only one cluster. Expectation Maximisation can instead be used to perform soft-clustering in which a belief is modelled about each point's cluster assignment in the form of a distribution.
K-means assigns points to clusters based on the Euclidean distance to the means (centroids) of those clusters. In our case we know that our data is generated from a Gaussian Mixture Model so any given point must come from one of K Gaussian distributions. Using the Euclidean distance to the mean fails to completely capture the likelihood of a point being generated from a given Gaussian because it is possible for points further away to be more likely than points that are closer if there is a larger variance in that direction.
Using a Gaussian Mixture Model, we would like to be able to estimate the model parameters, that maximise the likelihood of seeing the data contained in the dataset, . These parameter estimates can be used to obtain a distribution over the latent cluster assignments, , for each data point . This set of probabilities (distribution) over the clusters for a given data point is typically referred to as its responsibilities (i.e. how responsible is each cluster for generating this data point). The key problem that EM addresses is that estimating the model parameters directly from the data without knowledge of the cluster assignments is computationally intractable.
Intuitively, the problem of finding both the model parameters and cluster responsibilities can be viewed as a chicken and egg problem. If the parameters for each Gaussian and the mixing probabilities () are known then it is easy to compute the likelihood for each data point and build a belief about which cluster it belongs to. Similarly, if the cluster responsibilities are known then it is easy to fit the parameters of each Gaussian. This can be achieved for each cluster by fitting a Gaussian to the entire data set where each point has been weighted by the cluster's responsibility for that point.
How does the EM algorithm work? Instead of solving the problem all at once, the EM algorithm uses an iterative approach that alternates between two steps: an Expectation step and a Maximisation step.
The steps can be summarised as follows:
Initialise all model parameters, . This can initialised completely randomly or by running k-means first to get a starting point.
Expectation Step: Using the current , compute the responsibilities, , for each data point, .
Maximisation Step: Using the responsibilities, , estimate the model parameters .
Repeat steps 2 and 3 until the model parameters and cluster responsibilities converge (i.e. stop when the change between consecutive iterations is below some threshold).
For more information on EM, see Bishop section 9.2 - 9.4.
3.1 Performing EM on our example data
Before we look at an implementation of EM, the code below first defines some useful plotting functions. The plot_GMM_density
plots the density of the Gaussian mixture using the estimated parameters and the plot_GMM_mixture_components
function plots each of mixture components' densities separately.
The code in the cell below performs the EM algorithm for our data using the fit
function within scikit-learn's GaussianMixture
implementation. Run the code and compare estimated parameters at each iteration with their known values. Note that when creating the GaussianMixture
object we set max_iter=10
so that each iteration of the while True
loop below does 10 iterations of the EM algorithm each time the fit
method is called. We also set warm_start
to True
so that each call to fit
(apart from the first) starts from the values produced from the previous call to fit
. If you want to see the individual EM iterations alter what's below to have max_iter=1
(but that will produce a lot of plots!).
Uncomment the different plotting functions and observe how the parameter estimates change after each iteration.
3.2 Using the model to generate new data
We now have estimates for the parameters of our data distribution that we can use for a number of purposes. Example use cases include generating new samples or filling in missing data. Remember that in practice we will almost never have access to the underlying distribution of our data. These are some of the many advantages of EM over something like K-means.
Use the sample
function to generate 10,000 new samples and create a scatter plot of the result:
4) Extension Exercise: Colour Quantisation
In the previous sections we have seen how to apply the K-means and EM algorithms to some simple toy data. In this section we will look how these techniques can be applied to real data to perform Vector Quantisation (Colour Quantisation). In computer vision application, colour quantisation is the task of reducing the colour pallette of an image to a fixed number of colours.
The code below loads in an example image for use in this exercise but feel free to use your own image.
Using scikit-learn's K-means and EM implementations, apply both algorithms with to the image data such that each pixel location is treated as an individual 3-dimensional data point containing the values. Perform colour quantisation on the image by replacing the original colour values at each pixel location with the colour respresented by its corresponding cluster centroid (k-means) or distribution mean (EM). Finally, display the quantised images obtained from each method using plt.imshow
.
To display the quantised image correctly you may need to cast the numpy
array to an array of type np.uint8
. This can be achieved for a given matrix, image_data
, by using image_data.astype(np.uint8)
.
Wrap up
This lab covered a number of topics so let's recap:
We first looked at mixture models and showed how they can be used to combine simple distributions into more complex ones. For the case of a Gaussian Mixture Model (GMM), we saw how to generate samples from the distribution using the idea of ancestral sampling.
Next, we implemented the key steps of K-means algorithm and saw how this technique can be used to find clusters of data points within a dataset that are separated with hard boundaries.
We then looked at a different approach to clustering through the use of the Expectation Maximisation (EM) algorithm. We saw how this approach differs from K-means by modelling the individual cluster distributions to perform soft clustering.
Finally, we worked through an example use case of these methods by applying them to an image's pixel data to perform Colour Quantisation.
References
COMS30035 Machine Learning lecture notes.
Bishop Pattern Recognition and Machine Learning: Chapter 9.1 for the K-means algorithm and chapters 9.2-9.4 for Gaussian Mixture Models and Expectation Maximisation.