CoCalc -- lincoln.py

GitHub Repository: allendowney/thinkbayes2
Path: blob/master/scripts/lincoln.py
¹⁹⁰¹ views
1
"""This file contains code used in "Think Stats",
2
by Allen B. Downey, available from greenteapress.com
3

4
Copyright 2014 Allen B. Downey
5
License: GNU GPLv3 http://www.gnu.org/licenses/gpl.html
6
"""
7

8
from __future__ import print_function, division
9

10
import thinkbayes2
11
import thinkplot
12

13
import numpy
14

15
"""
16
Bayesian solution to the Lincoln index, described in a blog
17
article at Probably Overthinking It.
18

19
http://tinyurl.com/lincoln14
20

21
Last year my occasional correspondent John D. Cook wrote an excellent
22
blog post about the Lincoln index, which is a way to estimate the
23
number of errors in a document (or program) by comparing results from
24
two independent testers. 
25

26
http://www.johndcook.com/blog/2010/07/13/lincoln-index/
27

28
Here's his presentation of the problem:
29

30
   "Suppose you have a tester who finds 20 bugs in your program. You
31
want to estimate how many bugs are really in the program. You know
32
there are at least 20 bugs, and if you have supreme confidence in your
33
tester, you may suppose there are around 20 bugs. But maybe your
34
tester isn't very good. Maybe there are hundreds of bugs. How can you
35
have any idea how many bugs there are? There's no way to know with one
36
tester. But if you have two testers, you can get a good idea, even if
37
you don't know how skilled the testers are."
38

39
Then he presents the Lincoln index, an estimator "described by
40
Frederick Charles Lincoln in 1930," where Wikpedia's use of
41
"described" is a hint that the index is another example of Stigler's
42
law of eponymy.
43

44
    "Suppose two testers independently search for bugs. Let k1 be the
45
number of errors the first tester finds and k2 the number of errors
46
the second tester finds. Let c be the number of errors both testers
47
find. The Lincoln Index estimates the total number of errors as k1 k2
48
/ c [I changed his notation to be consistent with mine]."
49

50
So if the first tester finds 20 bugs, the second finds 15, and they
51
find 3 in common, we estimate that there are about 100 bugs.
52

53
Of course, whenever I see something like this, the idea that pops into
54
my head is that there must be a (better) Bayesian solution! And there
55
is.
56

57
"""
58

59
def choose(n, k, d={}):
60
    """The binomial coefficient "n choose k".
61

62
    Args:
63
      n: number of trials
64
      k: number of successes
65
      d: map from (n,k) tuples to cached results
66

67
    Returns:
68
      int
69
    """
70
    if k == 0:
71
        return 1
72
    if n == 0:
73
        return 0
74

75
    try:
76
        return d[n, k]
77
    except KeyError:
78
        res = choose(n-1, k) + choose(n-1, k-1)
79
        d[n, k] = res
80
        return res
81

82
def binom(k, n, p):
83
    """Computes the rest of the binomial PMF.
84

85
    k: number of hits
86
    n: number of attempts
87
    p: probability of a hit
88
    """
89
    return p**k * (1-p)**(n-k)
90

91

92
class Lincoln(thinkbayes2.Suite, thinkbayes2.Joint):
93
    """Represents hypotheses about the number of errors."""
94

95
    def Likelihood(self, data, hypo):
96
        """Computes the likelihood of the data under the hypothesis.
97

98
        hypo: n, p1, p2
99
        data: k1, k2, c
100
        """
101
        n, p1, p2 = hypo
102
        k1, k2, c = data
103

104
        part1 = choose(n, k1) * binom(k1, n, p1)
105
        part2 = choose(k1, c) * choose(n-k1, k2-c) * binom(k2, n, p2)
106
        return part1 * part2
107

108

109
def main():
110

111
    data = 20, 15, 3
112
    probs = numpy.linspace(0, 1, 31)
113
    hypos = []
114
    for n in range(32, 350):
115
        for p1 in probs:
116
            for p2 in probs:
117
                hypos.append((n, p1, p2))
118

119
    suite = Lincoln(hypos)
120
    suite.Update(data)
121

122
    n_marginal = suite.Marginal(0)
123

124
    thinkplot.Pmf(n_marginal, label='n')
125
    thinkplot.Save(root='lincoln1',
126
                   xlabel='number of bugs',
127
                   ylabel='PMF',
128
                   formats=['pdf', 'png'])
129

130
    print('post mean n', n_marginal.Mean())
131
    print('MAP n', n_marginal.MaximumLikelihood())
132
    
133
    
134
if __name__ == '__main__':
135
    main()
136

137
Product

Resources

Company