Path: blob/master/examples/dirichlet_soln.ipynb
1901 views
Lions and Tigers and Bears
Suppose we visit a wild animal preserve where we know that the only animals are lions and tigers and bears, but we don't know how many of each there are.
During the tour, we see 3 lions, 2 tigers and one bear. Assuming that every animal had an equal chance to appear in our sample, estimate the prevalence of each species.
What is the probability that the next animal we see is a bear?
Grid algorithm
I'll start with a grid algorithm, enumerating the space of prevalences, p1, p2, and p3, that add up to 1, and computing the likelihood of the data for each triple of prevalences.
Here's a simple way to find eligible triplets, but it is inefficient, and it runs into problems with floating-point approximations.
As an exercise, write a better version of enumerate_triples.
Now we can initialize the suite.
Here are functions for displaying the distributions
Here are the prior distributions
Now we can do the update.
And here are the posteriors.
To get the predictive probability of a bear, we can take the mean of the posterior marginal distribution:
Or we can do a pseudo-update and use the total probability of the data.
Using the Dirichlet object
The Dirichlet distribution is the conjugate prior for this likelihood function, so we can use the Dirichlet object to do the update.
The following is a monkey patch that gives Dirichlet objects a Marginal method.
Here are the priors:
Here's the update.
Here are the posterior PDFs.
And the CDFs.
And we can confirm that we get the same results as the grid algorithm.
MCMC
Exercise: Implement this model using MCMC. You might want to start with this example.
Here's the data.
Here's the MCMC model:
Check the traceplot
And let's see the results.
And compare them to what we got with Dirichlet:
Using a Multinomial distribution
Here's another solution that uses a Multinomial distribution instead of a Categorical. In this case, we represent the observed data using just the counts, [3, 2, 1], rather than a specific sequence of observations [0,0,0,1,1,2].
I suspect that this is a better option; because it uses a less specific representation of the data (without losing any information), I would expect the probability space to be easier to search.
This solution is based on this excellent notebook from Will Koehrsen.
The results look good. We can use summary to get the posterior means, and other summary stats.
We can also use plot_posterior to get a better view of the results.