Think Bayes
Second Edition
Copyright 2020 Allen B. Downey
License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Introduction
Three dimensions!
The Grizzly Bear Problem
In 1996 and 1997 Mowat and Strobeck deployed bear traps in locations in British Columbia and Alberta, in an effort to estimate the population of grizzly bears. They describe the experiment in "Estimating Population Size of Grizzly Bears Using Hair Capture, DNA Profiling, and Mark-Recapture Analysis"
The "trap" consists of a lure and several strands of barbed wire intended to capture samples of hair from bears that visit the lure. Using the hair samples, the researchers use DNA analysis to identify individual bears.
During the first session, on June 29, 1996, the researchers deployed traps at 76 sites. Returning 10 days later, they obtained 1043 hair samples and identified 23 different bears. During a second 10-day session they obtained 1191 samples from 19 different bears, where 4 of the 19 were from bears they had identified in the first batch.
To estimate the population of bears from this data, we need a model for the probability that each bear will be observed during each session. As a starting place, we'll make the simplest assumption, that every bear in the population has the same (unknown) probability of being sampled during each round.
With these assumptions we can compute the probability of the data for a range of possible populations.
As an example, let's suppose that the actual population of bears is 200.
After the first session, 23 of the 200 bears have been identified. During the second session, if we choose 19 bears at random, what is the probability that 4 of them were previously identified?
I'll define
N: actual (unknown) population size, 200.
K: number of bears identified in the first session, 23.
n: number of bears observed in the second session, 19 in the example.
k: the number of bears in the second session that had previously been identified, 4.
For given values of N, K, and n, the distribution of k is described by the hypergeometric distribution:
To understand why, consider:
The denominator, , is the number of subsets of we could choose from a population of bears.
The numerator is the number of subsets that contain bears from the previously identified and from the previously unseen .
SciPy provides hypergeom, which we can use to compute this PMF.
So that's the distribution of k given N, K, and n. Now let's go the other way: given K, n, and k, how can we estimate the total population, N?
As a starting place, let's suppose that, prior to this study, an expert in this domain would have estimated that the population is between 50 and 500, and equally likely to be any value in that range.
So that's our prior.
To compute the likelihood of the data, we can use hypergeom with constants K and n, and a range of values of N.
We can compute the posterior in the usual way.
And here's what it looks like.
The most likely value is 109.
But the distribution is skewed to the right, so the posterior mean is substantially higher.
Two parameter model
Two parameters better than one?
The Lincoln index problem
A few years ago my occasional correspondent John D. Cook wrote an excellent blog post about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers.
http://www.johndcook.com/blog/2010/07/13/lincoln-index/
Here's his presentation of the problem:
"Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There's no way to know with one tester. But if you have two testers, you can get a good idea, even if you don't know how skilled the testers are."
Suppose the first tester finds 20 bugs, the second finds 15, and they find 3 in common; how can we estimate the number of bugs?