Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/thinkbayes2
Path: blob/master/examples/lincoln.ipynb
1901 views
Kernel: Python 3

The Lincoln Index Problem

# If we're running on Colab, install libraries import sys IN_COLAB = 'google.colab' in sys.modules if IN_COLAB: !pip install arviz==0.6.1 !pip install pymc3==3.8 !pip install Theano==1.0.4

In an excellent blog post, John D. Cook wrote about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers.

Here's his presentation of the problem:

"Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There's no way to know with one tester. But if you have two testers, you can get a good idea, even if you don't know how skilled the testers are."

Suppose the first tester finds 20 bugs, the second finds 15, and they find 3 in common; how can we estimate the number of bugs?

I'll use the following notation for the data:

  • k11 is the number of bugs found by both testers,

  • k10 is the number of bugs found by the first tester but not the second,

  • k01 is the number of bugs found by the second tester but not the first, and

  • k00 is the unknown number of undiscovered bugs.

Here are the values for all but k00:

k10 = 20 - 3 k01 = 15 - 3 k11 = 3 num_seen = k01 + k10 + k11 num_seen

Here's the model:

import pymc3 as pm with pm.Model() as model5: p0 = pm.Beta('p0', alpha=1, beta=1) p1 = pm.Beta('p1', alpha=1, beta=1) N = pm.DiscreteUniform('N', num_seen, 350) q0 = 1-p0 q1 = 1-p1 ps = [q0*q1, q0*p1, p0*q1, p0*p1] k00 = N - num_seen data = pm.math.stack((k00, k01, k10, k11)) y = pm.Multinomial('y', n=N, p=ps, observed=data) with model5: trace5 = pm.sample(1000) with model5: pm.plot_posterior(trace5) with model5: pm.traceplot(trace5)