Path: blob/master/site/en-snapshot/probability/examples/Fitting_DPMM_Using_pSGLD.ipynb
25118 views
Copyright 2018 The TensorFlow Probability Authors.
Licensed under the Apache License, Version 2.0 (the "License");
Fitting Dirichlet Process Mixture Model Using Preconditioned Stochastic Gradient Langevin Dynamics
In this notebook, we will demonstrate how to cluster a large number of samples and infer the number of clusters simultaneously by fitting a Dirichlet Process Mixture of Gaussian distribution. We use Preconditioned Stochastic Gradient Langevin Dynamics (pSGLD) for inference.
Table of contents
Samples
Model
Optimization
Visualize the result
4.1. Clustered result
4.2. Visualize uncertainty
4.3. Mean and scale of selected mixture component
4.4. Mixture weight of each mixture component
4.5. Convergence of
4.6. Inferred number of clusters over iterations
4.7. Fitting the model using RMSProp
Conclusion
1. Samples
First, we set up a toy dataset. We generate 50,000 random samples from three bivariate Gaussian distributions.
2. Model
Here, we define a Dirichlet Process Mixture of Gaussian distribution with Symmetric Dirichlet Prior. Throughout the notebook, vector quantities are written in bold. Over samples, the model with a mixture of Gaussian distributions is formulated as follow:
where:
Our goal is to assign each to the th cluster through which represents the inferred index of a cluster.
For an ideal Dirichlet Mixture Model, is set to . However, it is known that one can approximate a Dirichlet Mixture Model with a sufficiently large . Note that although we arbitrarily set an initial value of , an optimal number of clusters is also inferred through optimization, unlike a simple Gaussian Mixture Model.
In this notebook, we use a bivariate Gaussian distribution as a mixture component and set to 30.
3. Optimization
We optimize the model with Preconditioned Stochastic Gradient Langevin Dynamics (pSGLD), which enables us to optimize a model over a large number of samples in a mini-batch gradient descent manner.
To update parameters in th iteration with mini-batch size , the update is sampled as:
ParseError: KaTeX parse error: Got function '\boldsymbol' with no arguments as subscript at position 300: …{ M } \nabla _ \̲b̲o̲l̲d̲s̲y̲m̲b̲o̲l̲ ̲{ \theta } \log…In the above equation, is learning rate at th iteration and is a sum of log prior distributions of . is a preconditioner which adjusts the scale of the gradient of each parameter.
We will use the joint log probability of the likelihood and the prior probabilities as the loss function for pSGLD.
Note that as specified in the API of pSGLD, we need to divide the sum of the prior probabilities by sample size .
4. Visualize the result
4.1. Clustered result
First, we visualize the result of clustering.
For assigning each sample to a cluster , we calculate the posterior of as:
We can see an almost equal number of samples are assigned to appropriate clusters and the model has successfully inferred the correct number of clusters as well.
4.2. Visualize uncertainty
Here, we look at the uncertainty of the clustering result by visualizing it for each sample.
We calculate uncertainty by using entropy:
In pSGLD, we treat the value of a training parameter at each iteration as a sample from its posterior distribution. Thus, we calculate entropy over values from iterations for each parameter. The final entropy value is calculated by averaging entropies of all the cluster assignments.
In the above graph, less luminance represents more uncertainty. We can see the samples near the boundaries of the clusters have especially higher uncertainty. This is intuitively true, that those samples are difficult to cluster.
4.3. Mean and scale of selected mixture component
Next, we look at selected clusters' and .
Again, the and close to the ground truth.
4.4 Mixture weight of each mixture component
We also look at inferred mixture weights.
We see only a few (three) mixture component have significant weights and the rest of the weights have values close to zero. This also shows the model successfully inferred the correct number of mixture components which constitutes the distribution of the samples.
4.5. Convergence of
We look at convergence of Dirichlet distribution's concentration parameter .
Considering the fact that smaller results in less expected number of clusters in a Dirichlet mixture model, the model seems to be learning the optimal number of clusters over iterations.
4.6. Inferred number of clusters over iterations
We visualize how the inferred number of clusters changes over iterations.
To do so, we infer the number of clusters over the iterations.
Over the iterations, the number of clusters is getting closer to three. With the result of convergence of to smaller value over iterations, we can see the model is successfully learning the parameters to infer an optimal number of clusters.
Interestingly, we can see the inference has already converged to the correct number of clusters in the early iterations, unlike converged in much later iterations.
4.7. Fitting the model using RMSProp
In this section, to see the effectiveness of Monte Carlo sampling scheme of pSGLD, we use RMSProp to fit the model. We choose RMSProp for comparison because it comes without the sampling scheme and pSGLD is based on RMSProp.
Compare to pSGLD, although the number of iterations for RMSProp is longer, optimization by RMSProp is much faster.
Next, we look at the clustering result.
The number of clusters was not correctly inferred by RMSProp optimization in our experiment. We also look at the mixture weight.
We can see the incorrect number of components have significant mixture weights.
Although the optimization takes longer time, pSGLD, which has Monte Carlo sampling scheme, performed better in our experiment.
5. Conclusion
In this notebook, we have described how to cluster a large number of samples as well as to infer the number of clusters simultaneously by fitting a Dirichlet Process Mixture of Gaussian distribution using pSGLD.
The experiment has shown the model successfully clustered samples and inferred the correct number of clusters. Also, we have shown the Monte Carlo sampling scheme of pSGLD allows us to visualize uncertainty in the result. Not only clustering the samples but also we have seen the model could infer the correct parameters of mixture components. On the relationship between the parameters and the number of inferred clusters, we have investigated how the model learns the parameter to control the number of effective clusters by visualizing the correlation between convergence of 𝛼 and the number of inferred clusters. Lastly, we have looked at the results of fitting the model using RMSProp. We have seen RMSProp, which is the optimizer without Monte Carlo sampling scheme, works considerably faster than pSGLD but has produced less accuracy in clustering.
Although the toy dataset only had 50,000 samples with only two dimensions, the mini-batch manner optimization used here is scalable for much larger datasets.