Path: blob/master/notebooks/book2/inf/advi_beta_binom.ipynb
1192 views
ADVI from scratch in JAX
Authors: karm-patel@, murphyk@
In this notebook we apply ADVI (automatic differentiation variational inference) to the beta-binomial model, using a Normal Distribution as Variational Posterior. This involves a change of variable from the unconstrained z in R space to the constrained theta in [0,1] space.
Functions
Helper functions which will be used later
Dataset
Now, we will create the dataset. we sample theta_true
(probability of occurring head) random variable from the prior distribution which is Beta in this case. Then we sample n_samples
coin tosses from likelihood distribution which is Bernouli in this case.
Prior, Likelihood, and True Posterior
For coin toss problem, since we know the closed form solution of posterior, we compare the distributions of Prior, Likelihood, and True Posterior below.
Optimizing the ELBO
In order to minimize KL divergence between true posterior and variational distribution, we need to minimize the negative ELBO, as we describe below.
We start with the ELBO, which is given by: where are the variational parameters, is the likelihood, and the prior is given by the change of variables formula: where is the Jacobian of the mapping. We will use a Monte Carlo approximation of the expectation over . We also apply the reparameterization trick to replace with Putting it altogether our estimate for the negative ELBO (for a single sample of ) is \begin{align} -L(\psi; z) &= -( \log p(\mathcal{D}|\theta ) +\log p( \theta) + \log|J_\boldsymbol{\sigma}(z)|)
\log q(z|\psi) \end{align}
We now apply stochastic gradient descent to minimize negative ELBO and optimize the variational parameters (loc
and scale
)
We now plot the ELBO
We can see that after 200 iterations ELBO is optimized and not changing too much.
Samples using Optimized parameters
Now, we take 1000 samples from variational distribution (Normal) and transform them into true posterior distribution (Beta) by applying tranform_fn
(sigmoid) on samples. Then we compare density of samples with exact posterior.
We can see that the learned q(x)
is a reasonably good approximation to the true posterior. It seems to have support over negative theta but this is an artefact of KDE.
References:
ADVI paper: https://arxiv.org/abs/1603.00788
Blog: https://code-first-ml.github.io/book2/notebooks/introduction/variational.html
Github issue: https://github.com/pyro-ppl/pyro/issues/3016#:~:text=loc%3D'upper right')-,Bandwidth%20adjustment,-Another%20thing%20to
Blog: https://ericmjl.github.io/dl-workshop/02-jax-idioms/02-loopy-carry.html