Path: blob/master/site/en-snapshot/probability/examples/Modeling_with_JointDistribution.ipynb
25118 views
Copyright 2019 The TensorFlow Authors.
Licensed under the Apache License, Version 2.0 (the "License");
Bayesian Modeling with Joint Distribution
JointDistributionSequential
is a newly introduced distribution-like Class that empowers users to fast prototype Bayesian model. It lets you chain multiple distributions together, and use lambda function to introduce dependencies. This is designed to build small- to medium- size Bayesian models, including many commonly used models like GLMs, mixed effect models, mixture models, and more. It enables all the necessary features for a Bayesian workflow: prior predictive sampling, It could be plug-in to another larger Bayesian Graphical model or neural network. In this Colab, we will show some examples of how to use JointDistributionSequential
to achieve your day to day Bayesian workflow
Dependencies & Prerequisites
Make things Fast!
Before we dive in, let's make sure we're using a GPU for this demo.
To do this, select "Runtime" -> "Change runtime type" -> "Hardware accelerator" -> "GPU".
The following snippet will verify that we have access to a GPU.
Note: if for some reason you cannot access a GPU, this colab will still work. (Training will just take longer.)
JointDistribution
Notes: This distribution class is useful when you just have a simple model. "Simple" means chain-like graphs; although the approach technically works for any PGM with degree at most 255 for a single node (Because Python functions can have at most this many args).
The basic idea is to have the user specify a list of callable
s which produce tfp.Distribution
instances, one for every vertex in their PGM. The callable
will have at most as many arguments as its index in the list. (For user convenience, aguments will be passed in reverse order of creation.) Internally we'll "walk the graph" simply by passing every previous RV's value into each callable. In so doing we implement the [chain rule of probablity](https://en.wikipedia.org/wiki/Chain_rule_(probability)#More_than_two_random_variables): .
The idea is pretty simple, even as Python code. Here's the gist:
You can find more information from the docstring of JointDistributionSequential
, but the gist is that you pass a list of distributions to initialize the Class, if some distributions in the list is depending on output from another upstream distribution/variable, you just wrap it with a lambda function. Now let's see how it works in action!
(Robust) Linear regression
From PyMC3 doc GLM: Robust Regression with Outlier Detection
Conventional OLS Model
Now, let's set up a linear model, a simple intercept + slope regression problem:
You can then check the graph of the model to see the dependence. Note that x
is reserved as the name of the last node, and you cannot sure it as your lambda argument in your JointDistributionSequential model.
Sampling from the model is quite straightforward:
...which gives a list of tf.Tensor. You can immediately plug it into the log_prob function to compute the log_prob of the model:
Hmmm, something is not right here: we should be getting a scalar log_prob! In fact, we can further check to see if something is off by calling the .log_prob_parts
, which gives the log_prob
of each nodes in the Graphical model:
...turns out the last node is not being reduce_sum along the i.i.d. dimension/axis! When we do the sum the first two variable is thus incorrectly broadcasted.
The trick here is to use tfd.Independent
to reinterpreted the batch shape (so that the rest of the axis will be reduced correctly):
Now, lets check the last node/distribution of the model, you can see that event shape is now correctly interpreted. Note that it might take a bit of trial and error to get the reinterpreted_batch_ndims
right, but you can always easily print the distribution or sampled tensor to double check the shape!
Other JointDistribution*
API
MLE
And we can now do inference! You can use optimizer to find the Maximum likelihood estimation.
Batched version model and MCMC
In Bayesian Inference, we usually want to work with MCMC samples, as when the samples are from the posterior, we can plug them into any function to compute expectations. However, the MCMC API require us to write models that are batch friendly, and we can check that our model is actually not "batchable" by calling sample([...])
In this case, it is relatively straightforward as we only have a linear function inside our model, expanding the shape should do the trick:
We can again sample and evaluate the log_prob_parts to do some checks:
Some side notes:
We want to work with batch version of the model because it is the fastest for multi-chain MCMC. In cases that you cannot rewrite the model as a batched version (e.g., ODE models), you can map the log_prob function using
tf.map_fn
to achieve the same effect.Now
mdl_ols_batch.sample()
might not work as we have scaler prior, as we cannot doscaler_tensor[:, None]
. The solution here is to expand scaler tensor to rank 1 by wrappingtfd.Sample(..., sample_shape=1)
.It is a good practice to write the model as a function so that you can change set ups like hyperparameters much easier.
MCMC using the No-U-Turn Sampler
Student-T Method
Note that from now on we always work with the batch version of a model
Forward sample (prior predictive sampling)
MLE
MCMC
Hierarchical Partial Pooling
From PyMC3 baseball data for 18 players from Efron and Morris (1975)
Forward sample (prior predictive sampling)
Again, notice how if you dont use Independent you will end up with log_prob that has wrong batch_shape.
MLE
A pretty amazing feature of tfp.optimizer
is that, you can optimized in parallel for k batch of starting point and specify the stopping_condition
kwarg: you can set it to tfp.optimizer.converged_all
to see if they all find the same minimal, or tfp.optimizer.converged_any
to find a local solution fast.
LBFGS did not converged.
MCMC
Mixed effect model (Radon)
The last model in the PyMC3 doc: A Primer on Bayesian Methods for Multilevel Modeling
Some changes in prior (smaller scale etc)
For models with complex transformation, implementing it in a functional style would make writing and testing much easier. Also, it makes programmtically generate log_prob function that conditioned on (mini-batch) of inputted data much easier:
Variational Inference
One very powerful feature of JointDistribution*
is that you can generate an approximation easily for VI. For example, to do meanfield ADVI, you simply inspect the graph and replace all the none observed distribution with a Normal distribution.
Meanfield ADVI
You can also use the experimential feature in tensorflow_probability/python/experimental/vi to build variational approximation, which are essentially the same logic used below (i.e., using JointDistribution to build approximation), but with the approximation output in the original space instead of the unbounded space.
FullRank ADVI
For full rank ADVI, we want to approximate the posterior with a multivariate Gaussian.
Beta-Bernoulli Mixture Model
A mixture model where multiple reviewer labeling some items, with unknown (true) latent labels.