Path: blob/master/examples/elephants_soln.ipynb
1901 views
Cats and rats and elephants
Suppose there are six species that might be in a zoo: lions and tigers and bears, and cats and rats and elephants. Every zoo has a subset of these species, and every subset is equally likely.
One day we visit a zoo and see 3 lions, 2 tigers, and one bear. Assuming that every animal in the zoo has an equal chance to be seen, what is the probability that the next animal we see is an elephant?
Solution
I'll start by enumerating all possible zoos with itertools.
Now we can enumerate only the zoos that are possible, given a set of animals known to be present.
Here are the possible zoos.
To represent the prior and posterior distributions I'll use a hierarchical model with one Dirichlet object for each possible zoo.
At the bottom of the hierarchy, it is easy to update each Dirichlet object just by adding the observed frequencies to the parameters.
In order to update the top of the hierarchy, we need the total probability of the data for each hypothetical zoo. When we do an update using grid algorithms, we get the probability of the data free, since it is the normalizing constant.
But when we do an update using a conjugate distribution, we don't get the total probability of the data, and for a Dirichlet distribution it is not easy to compute.
However, we can estimate it by drawing samples from the Dirichlet distribution, and then computing the probability of the data for each sample.
Here's an example that represents a zoo with 4 animals.
Here's a sample from it.
Now we can compute the probability of the data, given these prevalences, using the multinomial distribution.
Since I only observed 3 species, and my hypothetical zoo has 4, I had to zero-pad the data. Here's a function that makes that easier:
Here's an example:
Let's pull all that together. Here's a function that estimates the total probability of the data by sampling from the dirichlet distribution:
And here's an example:
Now we're ready to solve the problem.
Here's a Suite that represents the set of possible zoos. The likelihood of any zoo is just the total probability of the data.
We can construct the prior by enumerating the possible zoos.
We can update the top level of the hierarchy by calling Update
We have to update the bottom level explicitly.
Here's the posterior for the top level.
Here's how we can get the posterior distribution of n, the number of species.
And here's what it looks like.
Now, to answer the question, we have to compute the posterior distribution of the prevalence of elephants. Here's a function that computes it.
Here are the possible zoos, the posterior probability of each, and the conditional prevalence of elephants for each.
Finally, we can use the law of total probability to compute the probability of seeing an elephant.