You can order print and ebook versions of Think Bayes 2e from Bookshop.org and Amazon.
Grid algorithms for hierarchical models
Copyright 2021 Allen B. Downey
License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
It is widely believed that grid algorithms are only practical for models with 1-3 parameters, or maybe 4-5 if you are careful. I've said so myself.
But recently I used a grid algorithm to solve the emitter-detector problem, and along the way I noticed something about the structure of the problem: although the model has two parameters, the data only depend on one of them. That makes it possible to evaluate the likelihood function and update the model very efficiently.
Many hierarchical models have a similar structure: the data depend on a small number of parameters, which depend on a small number of hyperparameters. I wondered whether the same method would generalize to more complex models, and it does.
As an example, in this notebook I'll use a logitnormal-binomial hierarchical model to solve a problem with two hyperparameters and 13 parameters. The grid algorithm is not just practical; it's substantially faster than MCMC.
The following are some utility functions I'll use.
Heart Attack Data
The problem I'll solve is based on Chapter 10 of Probability and Bayesian Modeling; it uses data on death rates due to heart attack for patients treated at various hospitals in New York City.
We can use Pandas to read the data into a DataFrame.
The columns we need are Cases, which is the number of patients treated at each hospital, and Deaths, which is the number of those patients who died.
Solution with PyMC
Here's a hierarchical model that estimates the death rate for each hospital and simultaneously estimates the distribution of rates across hospitals.
To be fair, PyMC doesn't like this parameterization much (although I'm not sure why). One most runs, there are a moderate number of divergences. Even so, the results are good enough.
Here are the posterior distributions of the hyperparameters.
And we can extract the posterior distributions of the xs.
As an example, here's the posterior distribution of x for the first hospital.
The grid priors
Now let's solve the same problem using a grid algorithm. I'll use the same priors for the hyperparameters, approximated by a grid with about 100 elements in each dimension.
The following cells confirm that these priors are consistent with the prior samples from PyMC.
The joint distribution of hyperparameters
I'll use make_joint to make an array that represents the joint prior distribution of the hyperparameters.
Here's what it looks like.
Joint prior of hyperparameters and x
Now we're ready to lay out the grid for x, which is the proportion we'll estimate for each hospital.
For each pair of hyperparameters, we'll compute the distribution of x.
We can speed this up by computing skipping the terms that don't depend on x
The result is a 3-D array with axes for mu, sigma, and x.
Now we need to normalize each distribution of x.
To normalize, we have to use a safe version of divide where 0/0 is 0.
The result is an array that contains the distribution of x for each pair of hyperparameters.
Now, to get the prior distribution, we multiply through by the joint distribution of the hyperparameters.
The result is a 3-D array that represents the joint prior distribution of mu, sigma, and x.
To check that it is correct, I'll extract the marginal distributions and compare them to the priors.
We didn't compute the prior distribution of x explicitly; it follows from the distribution of the hyperparameters. But we can extract the prior marginal of x from the joint prior.
And compare it to the prior sample from PyMC.
The prior distribution of x I get from the grid is a bit different from what I get from PyMC. I'm not sure why, but it doesn't seem to affect the results much.
In addition to the marginals, we'll also find it useful to extract the joint marginal distribution of the hyperparameters.
The Update
The likelihood of the data only depends on x, so we can compute it like this.
And here's the update.
Serial updates
At this point we can do an update based on a single hospital, but how do we update based on all of the hospitals?
As a step toward the right answer, I'll start with a wrong answer, which is to do the updates one at a time.
After each update, we extract the posterior distribution of the hyperparameters and use it to create the prior for the next update.
At the end, the posterior distribution of hyperparameters is correct, and the marginal posterior of x for the last hospital is correct, but the other marginals are wrong because they do not take into account data from subsequent hospitals.
Here are the posterior distributions of the hyperparameters, compared to the results from PyMC.
Parallel updates
Doing updates one at time is not quite right, but it gives us an insight.
Suppose we start with a uniform distribution for the hyperparameters and do an update with data from one hospital. If we extract the posterior joint distribution of the hyperparameters, what we get is the likelihood function associated with one dataset.
The following function computes these likelihood functions and saves them in an array called hyper_likelihood.
We can multiply this out to get the product of the likelihoods.
This is useful because it provides an efficient way to compute the marginal posterior distribution of x for any hospital. Here's an example.
Suppose we did the updates serially and saved this hospital for last. The prior distribution for the final update would reflect the updates from all previous hospitals, which we can compute by dividing out hyper_likelihood[i].
We can use hyper_i to make the prior for the last update.
And then do the update.
And we can confirm that the results are similar to the results from PyMC.
Compute all marginals
The following function computes the marginals for all hospitals and stores the results in an array.
Here's what the results look like, compared to the results from PyMC.
And here are the percentage differences between the results from the grid algorithm and PyMC. Most of them are less than 1%.
The total time to do all of these computations is about 300 ms, compared to more than 10 seconds to make and run the PyMC model. And PyMC used 4 cores; I only used one.
The grid algorithm is easy to parallelize, and it's incremental. If you get data from a new hospital, or new data for an existing one, you can:
Compute the posterior distribution of
xfor the updated hospital, using existinghyper_likelihoodsfor the other hospitals.Update
hyper_likelihoodsfor the other hospitals, and run their updates again.
The total time would be about half of what it takes to start from scratch, and it's easy to parallelize.
One drawback of the grid algorithm is that it generates marginal distributions for each hospital rather than a sample from the joint distribution of all of them. So it's less easy to see the correlations among them.
The other drawback, in general, is that it takes more work to set up the grid algorithm. If we switch to another parameterization, it's easier to change the PyMC model.