Path: blob/master/notebooks/misc/dp_mixgauss_truncated.ipynb
1192 views
Dirichlet Processes: forward sampling and posterior inference with truncated stick breaking construction.
translated from
https://ericmjl.github.io/dl-workshop/04-gaussian-clustering/02-dirichlet-processes.html
and
https://ericmjl.github.io/dl-workshop/04-gaussian-clustering/03-dirichlet-process-clustering.html
The stick-breaking definition of the Dirichlet process (DP) with concentration parameter is
The stick-breaking definition of DP indicates that we can sample from a DP sequentially by samples from beta distribution. Since it is computationally intractable to deal with infinite weights , of a sample from DP, in practice we can truncate the sequence to finite length during simulation.
As is visible above, when concentration parameter , most of the probability mass is concentrated at roughly the first 5-8 items.
We explore the effect of the concentration parameter on DP by plotting samples from DPs with differerent values of the concentration paramter.
It can be seen from the plots above that as the concentration parameter increases, the random probability measure sampled from DP becomes less concentrated, which means:
The probability mass allocated to the components that have significant probability mass decreases;
More components have relatively "significant" amounts of probability mass allocated.
The stick-breaking construction of DP also enables us to evaluate the likelihood of the DP sample under the beta distribution. The first step is to reverse the stick-breaking procedure and obtain the beta-distributed variables from the DP sample.
We compare the true s used in the stick-breaking construction to the s inferred with the backwards transterring algorithm. Notice that the inverse transformation is not exact because we truncated and renormalized the sequence of weights when sampling from the DP using stick-breaking procedure, and in practice we do not know the original accumulated probability given the sample of DP.
We can then evaluate the likelihood of the DP sample as the joint likelihood of 's obtained with the inverse transformation.
Given the likelihood function, we can obtain the maximum likelihood estimate (MLE) of the concentration parameter of the DP. If we define the loss function of the concentration parameter to be the negative of its likelihood, to maximize the likelihood is equivalent to minimize this loss function.
With Jax, it is straightforward to take the gradient descent approach in search of the MLE, where we make use of the 'grad' function to compute the gradient automatically.
Gaussian mixture models are widely used in distribution approximation and clustering analysis. However, in practice the number of components of the mixture distribution is usually not known.
In the low diension space (e.g. 1-d), we can roughly read the number of components in the mixture distribution from the histogram of the samples:
However, even in the 1-d space, it is not always obvious to read the number of modes in a mixture distribution from the histogram. The information starts blurring as we increase the number of bins in the histogram.
Where the number of clusters is not clear, we can turn to DP mixture models, which does not impose an exact number of clusters, but controls the number of clusters probabilistically with a single concentration parameter.
Here we fit a DP Gaussian mixture model (DP-GMM) to above data and learn
the concentration parameter of the DP,
the optimal relative weighting of components, conditioned on concentration parameters, and
distribution paraemters for each component, conditioned on the data.
We define the Bayesian model and try to obtain the maxinum a posterior (MAP) estimation of the these parameters. This is equivalent to miaximize the joint likelihood of these parameters and the data. Since we do not assign a prior distribution for the concentration parameter of the DP, the estimation of the concentration parameter can be seen as an empirical Bayesian approach.
To fit the DP mixture model using the gradient descent algorithm, we firstly defining the joint log likelihood of the DP mixture model, which is the negative loss function to be optimized. The joint log likelihood of the DP mixture model is the summation of the log likelihood of data under the Gaussian mixture distribution and the log likelihood of the weights of mixture components. We have defined the later part using the beta distribution in our earlier discussion, and here we define the function that computes the log likelihood of data under the Gaussian mixture distribution.
We can now start fitting the DP mixture model.