Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/thinkbayes2
Path: blob/master/scripts/lincoln.py
1901 views
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3
4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7
8
from __future__ import print_function, division
9
10
import thinkbayes2
11
import thinkplot
12
13
import numpy
14
15
"""
16
Bayesian solution to the Lincoln index, described in a blog
17
article at Probably Overthinking It.
18
19
http://tinyurl.com/lincoln14
20
21
Last year my occasional correspondent John D. Cook wrote an excellent
22
blog post about the Lincoln index, which is a way to estimate the
23
number of errors in a document (or program) by comparing results from
24
two independent testers.
25
26
http://www.johndcook.com/blog/2010/07/13/lincoln-index/
27
28
Here's his presentation of the problem:
29
30
"Suppose you have a tester who finds 20 bugs in your program. You
31
want to estimate how many bugs are really in the program. You know
32
there are at least 20 bugs, and if you have supreme confidence in your
33
tester, you may suppose there are around 20 bugs. But maybe your
34
tester isn't very good. Maybe there are hundreds of bugs. How can you
35
have any idea how many bugs there are? There's no way to know with one
36
tester. But if you have two testers, you can get a good idea, even if
37
you don't know how skilled the testers are."
38
39
Then he presents the Lincoln index, an estimator "described by
40
Frederick Charles Lincoln in 1930," where Wikpedia's use of
41
"described" is a hint that the index is another example of Stigler's
42
law of eponymy.
43
44
"Suppose two testers independently search for bugs. Let k1 be the
45
number of errors the first tester finds and k2 the number of errors
46
the second tester finds. Let c be the number of errors both testers
47
find. The Lincoln Index estimates the total number of errors as k1 k2
48
/ c [I changed his notation to be consistent with mine]."
49
50
So if the first tester finds 20 bugs, the second finds 15, and they
51
find 3 in common, we estimate that there are about 100 bugs.
52
53
Of course, whenever I see something like this, the idea that pops into
54
my head is that there must be a (better) Bayesian solution! And there
55
is.
56
57
"""
58
59
def choose(n, k, d={}):
60
"""The binomial coefficient "n choose k".
61
62
Args:
63
n: number of trials
64
k: number of successes
65
d: map from (n,k) tuples to cached results
66
67
Returns:
68
int
69
"""
70
if k == 0:
71
return 1
72
if n == 0:
73
return 0
74
75
try:
76
return d[n, k]
77
except KeyError:
78
res = choose(n-1, k) + choose(n-1, k-1)
79
d[n, k] = res
80
return res
81
82
def binom(k, n, p):
83
"""Computes the rest of the binomial PMF.
84
85
k: number of hits
86
n: number of attempts
87
p: probability of a hit
88
"""
89
return p**k * (1-p)**(n-k)
90
91
92
class Lincoln(thinkbayes2.Suite, thinkbayes2.Joint):
93
"""Represents hypotheses about the number of errors."""
94
95
def Likelihood(self, data, hypo):
96
"""Computes the likelihood of the data under the hypothesis.
97
98
hypo: n, p1, p2
99
data: k1, k2, c
100
"""
101
n, p1, p2 = hypo
102
k1, k2, c = data
103
104
part1 = choose(n, k1) * binom(k1, n, p1)
105
part2 = choose(k1, c) * choose(n-k1, k2-c) * binom(k2, n, p2)
106
return part1 * part2
107
108
109
def main():
110
111
data = 20, 15, 3
112
probs = numpy.linspace(0, 1, 31)
113
hypos = []
114
for n in range(32, 350):
115
for p1 in probs:
116
for p2 in probs:
117
hypos.append((n, p1, p2))
118
119
suite = Lincoln(hypos)
120
suite.Update(data)
121
122
n_marginal = suite.Marginal(0)
123
124
thinkplot.Pmf(n_marginal, label='n')
125
thinkplot.Save(root='lincoln1',
126
xlabel='number of bugs',
127
ylabel='PMF',
128
formats=['pdf', 'png'])
129
130
print('post mean n', n_marginal.Mean())
131
print('MAP n', n_marginal.MaximumLikelihood())
132
133
134
if __name__ == '__main__':
135
main()
136
137