Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/thinkbayes2
Path: blob/master/soln/combinatorics.ipynb
1901 views
Kernel: Python 3

Think Bayes

Second Edition

Copyright 2020 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

# If we're running on Colab, install empiricaldist # https://pypi.org/project/empiricaldist/ import sys IN_COLAB = 'google.colab' in sys.modules if IN_COLAB: !pip install empiricaldist
# Get utils.py and create directories import os if not os.path.exists('utils.py'): !wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py if not os.path.exists('figs'): !mkdir figs
import numpy as np import pandas as pd import matplotlib.pyplot as plt from empiricaldist import Pmf, Cdf from utils import decorate, savefig

Introduction

Three dimensions!

The Grizzly Bear Problem

In 1996 and 1997 Mowat and Strobeck deployed bear traps in locations in British Columbia and Alberta, in an effort to estimate the population of grizzly bears. They describe the experiment in "Estimating Population Size of Grizzly Bears Using Hair Capture, DNA Profiling, and Mark-Recapture Analysis"

The "trap" consists of a lure and several strands of barbed wire intended to capture samples of hair from bears that visit the lure. Using the hair samples, the researchers use DNA analysis to identify individual bears.

During the first session, on June 29, 1996, the researchers deployed traps at 76 sites. Returning 10 days later, they obtained 1043 hair samples and identified 23 different bears. During a second 10-day session they obtained 1191 samples from 19 different bears, where 4 of the 19 were from bears they had identified in the first batch.

To estimate the population of bears from this data, we need a model for the probability that each bear will be observed during each session. As a starting place, we'll make the simplest assumption, that every bear in the population has the same (unknown) probability of being sampled during each round.

With these assumptions we can compute the probability of the data for a range of possible populations.

As an example, let's suppose that the actual population of bears is 200.

After the first session, 23 of the 200 bears have been identified. During the second session, if we choose 19 bears at random, what is the probability that 4 of them were previously identified?

I'll define

  • N: actual (unknown) population size, 200.

  • K: number of bears identified in the first session, 23.

  • n: number of bears observed in the second session, 19 in the example.

  • k: the number of bears in the second session that had previously been identified, 4.

For given values of N, K, and n, the distribution of k is described by the hypergeometric distribution:

PMF(k)=(Kk)(NKnk)/(Nn)PMF(k) = {K \choose k}{N-K \choose n-k}/{N \choose n}

To understand why, consider:

  • The denominator, (Nn){ N \choose n}, is the number of subsets of nn we could choose from a population of NN bears.

  • The numerator is the number of subsets that contain kk bears from the previously identified KK and nkn-k from the previously unseen NKN-K.

SciPy provides hypergeom, which we can use to compute this PMF.

from scipy.stats import hypergeom N = 100 K = 23 n = 19 ks = np.arange(12) ps = hypergeom(N, K, n).pmf(ks) plt.bar(ks, ps) decorate(xlabel='Number of bears observed twice', ylabel='PMF', title='Hypergeometric distribution of k (known population 200)')
Image in a Jupyter notebook

So that's the distribution of k given N, K, and n. Now let's go the other way: given K, n, and k, how can we estimate the total population, N?

As a starting place, let's suppose that, prior to this study, an expert in this domain would have estimated that the population is between 50 and 500, and equally likely to be any value in that range.

Ns = np.arange(50, 501) prior_N = Pmf(1, Ns) prior_N.index.name = 'N'

So that's our prior.

To compute the likelihood of the data, we can use hypergeom with constants K and n, and a range of values of N.

K = 23 n = 19 k = 4 likelihood = hypergeom(Ns, K, n).pmf(k)

We can compute the posterior in the usual way.

posterior_N = prior_N * likelihood posterior_N.normalize()
34.97606148975133

And here's what it looks like.

posterior_N.plot() decorate(xlabel='Population of bears (N)', ylabel='PDF', title='Posterior distribution of N')
Image in a Jupyter notebook

The most likely value is 109.

posterior_N.max_prob()
109

But the distribution is skewed to the right, so the posterior mean is substantially higher.

posterior_N.mean()
173.79880627085643
posterior_N.credible_interval(0.9)
array([ 77., 363.])

Two parameter model

ps = np.linspace(0, 1, 101) prior_p = Pmf(1, ps) prior_p.index.name = 'p'
from utils import make_joint joint_prior = make_joint(prior_N, prior_p)
N_mesh, p_mesh = np.meshgrid(Ns, ps) N_mesh.shape
(101, 451)
from scipy.stats import binom like1 = binom.pmf(K, N_mesh, p_mesh) like1.sum()
229.56130903199954
like2 = binom.pmf(k, K, p_mesh) * binom.pmf(n-k, N_mesh-K, p_mesh) like2.sum()
24.771767481921685
from utils import normalize joint_posterior = joint_prior * like1 * like2 normalize(joint_posterior)
def plot_contour(joint, **options): """Plot a joint distribution. joint: DataFrame representing a joint PMF """ cs = plt.contour(joint.columns, joint.index, joint, **options) decorate(xlabel=joint.columns.name, ylabel=joint.index.name) return cs
plot_contour(joint_posterior) decorate(title='Joint posterior distribution of N and p')
Image in a Jupyter notebook
from utils import marginal marginal_N = marginal(joint_posterior, 0) posterior_N.plot(color='gray') marginal_N.plot() decorate(xlabel='Population of bears (N)', ylabel='PDF', title='Posterior marginal distribution of N')
Image in a Jupyter notebook
marginal_N.mean(), marginal_N.credible_interval(0.9)
(138.7505213647261, array([ 68., 277.]))
marginal_p = marginal(joint_posterior, 1) marginal_p.plot() decorate(xlabel='Probability of observing a bear', ylabel='PDF', title='Posterior marginal distribution of p')
Image in a Jupyter notebook
from seaborn import JointGrid def joint_plot(joint, **options): x = joint.columns.name x = 'x' if x is None else x y = joint.index.name y = 'y' if y is None else y # make a JointGrid with minimal data data = pd.DataFrame({x:[0], y:[0]}) g = JointGrid(x, y, data, **options) # replace the contour plot g.ax_joint.contour(joint.columns, joint.index, joint, cmap='viridis') # replace the marginals marginal_x = marginal(joint, 0) g.ax_marg_x.plot(marginal_x.qs, marginal_x.ps) marginal_y = marginal(joint, 1) g.ax_marg_y.plot(marginal_y.ps, marginal_y.qs)
joint_plot(joint_posterior)
Image in a Jupyter notebook

Two parameters better than one?

mean = (23 + 19) / 2 N1 = 138 p = mean/N1 p
0.15217391304347827
from scipy.stats import binom binom(N1, p).std()
4.219519857292647
binom(N1, p).pmf([23, 19]).prod()
0.007113233315460428
N2 = 173 p = mean/N2 p
0.12138728323699421
binom(N2, p).std()
4.2954472470306415
binom(N2, p).pmf([23, 19]).prod()
0.006918408214247037

The Lincoln index problem

A few years ago my occasional correspondent John D. Cook wrote an excellent blog post about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers.

http://www.johndcook.com/blog/2010/07/13/lincoln-index/

Here's his presentation of the problem:

"Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There's no way to know with one tester. But if you have two testers, you can get a good idea, even if you don't know how skilled the testers are."

Suppose the first tester finds 20 bugs, the second finds 15, and they find 3 in common; how can we estimate the number of bugs?

n0 = 20 n1 = 15 k11 = 3
k10 = n0 - k11 k01 = n1 - k11 k10, k01
(17, 12)
Ns = np.arange(32, 350) prior_N = Pmf(1, Ns) prior_N.index.name = 'N'
p0, p1 = 0.2, 0.15 like0 = binom.pmf(n0, Ns, p0) like0.sum()
4.9999998675182855
like1 = binom.pmf(k01, Ns-n0, p1) * binom.pmf(k11, n0, p1) like1.sum()
1.6188593076346907
likelihood = like0 * like1 likelihood.shape
(318,)
prior_N.shape
(318,)
posterior_N = prior_N * likelihood posterior_N.normalize()
0.10960645032182464
posterior_N.plot() decorate(xlabel='n', ylabel='PMF', title='Posterior marginal distribution of n with known p1, p2')
Image in a Jupyter notebook
posterior_N.mean()
102.12500000000003

Unknown probabilities

p0 = np.linspace(0, 1, 61) prior_p0 = Pmf(1, p0) prior_p0.index.name = 'p0'
p1 = np.linspace(0, 1, 51) prior_p1 = Pmf(1, p1) prior_p1.index.name = 'p1'
from utils import make_joint joint = make_joint(prior_p0, prior_p1) joint.shape
(51, 61)
joint_pmf = Pmf(joint.transpose().stack()) joint_pmf.head()
p0 p1 0.0 0.00 1 0.02 1 0.04 1 0.06 1 0.08 1 dtype: int64
joint_prior = make_joint(prior_N, joint_pmf) joint_prior.shape
(3111, 318)
joint_prior.head()
likelihood = joint_prior.copy() Ns = joint_prior.columns for (p0, p1) in joint_prior.index: like0 = binom.pmf(n0, Ns, p0) like1 = binom.pmf(k01, Ns-n0, p1) * binom.pmf(k11, n0, p1) likelihood.loc[p0, p1] = like0 * like1 likelihood.to_numpy().sum()
8.930285561728951
from utils import normalize joint_posterior = joint_prior * likelihood normalize(joint_posterior) joint_posterior.shape
(3111, 318)
from utils import marginal posterior_N = marginal(joint_posterior, 0) posterior_N.plot()
Image in a Jupyter notebook
posterior_pmf = marginal(joint_posterior, 1) posterior_pmf.shape
(3111,)
posterior_joint_ps = posterior_pmf.unstack().transpose() posterior_joint_ps.head()
def plot_contour(joint, **options): """Plot a joint distribution. joint: DataFrame representing a joint PMF """ cs = plt.contour(joint.columns, joint.index, joint, **options) decorate(xlabel=joint.columns.name, ylabel=joint.index.name) return cs
plot_contour(posterior_joint_ps) decorate(title='Posterior joint distribution for p1 and p2')
Image in a Jupyter notebook
posterior_p1 = marginal(posterior_joint_ps, 0) posterior_p2 = marginal(posterior_joint_ps, 1) posterior_p1.plot(label='p1') posterior_p2.plot(label='p2') decorate(xlabel='Probability of finding a bug', ylabel='PDF', title='Posterior marginal distributions of p1 and p2')
Image in a Jupyter notebook
posterior_p1.mean(), posterior_p1.credible_interval(0.9)
(0.22970406720316164, array([0.1, 0.4]))
posterior_p2.mean(), posterior_p2.credible_interval(0.9)
(0.17501024717938662, array([0.06, 0.32]))
joint_plot(posterior_joint_ps)
Image in a Jupyter notebook

Chao et al

data = [-1000, 63, 55, 18, 69, 17, 21, 28]
index = pd.MultiIndex.from_product([[0, 1]]*3)
Kijk = pd.Series(data, index) Kijk
0 0 0 -1000 1 63 1 0 55 1 18 1 0 0 69 1 17 1 0 21 1 28 dtype: int64
Kijk.xs(1, level=0).sum()
135
Kijk.xs(1, level=1).sum()
122
Kijk.xs(1, level=2).sum()
126
n = pd.Series(1, range(3)) for level in n.index: n[level] = Kijk.xs(1, level=level).sum() n
0 135 1 122 2 126 dtype: int64
Kixx = Kijk.sum(level=0) Kixx
0 -864 1 135 dtype: int64
Kijx = Kijk.sum(level=[0,1]) Kijx
0 0 -937 1 73 1 0 86 1 49 dtype: int64
def num_observed(K): return np.asarray(K)[1:].sum()
s0 = num_observed(Kixx) s0
135
s1 = num_observed(Kijx) s1
208
s2 = num_observed(Kijk) s2
271
s = [s0, s1, s2] s
[135, 208, 271]
k = pd.concat([Kixx, Kijx, Kijk]) k
0 -864 1 135 (0, 0) -937 (0, 1) 73 (1, 0) 86 (1, 1) 49 (0, 0, 0) -1000 (0, 0, 1) 63 (0, 1, 0) 55 (0, 1, 1) 18 (1, 0, 0) 69 (1, 0, 1) 17 (1, 1, 0) 21 (1, 1, 1) 28 dtype: int64
k[1]
135
k[0,1]
73
k[0,0,1]
63
k[0]
-864
N = 300 k0 = N - s[0] k0
165
k00 = N - s[1] k00
92
k000 = N - s[2] k000
29
Ns = np.arange(s2, 500, 5) prior_N = Pmf(1, Ns) prior_N.index.name = 'N'
ps = np.linspace(0, 1, 101) prior_p = Pmf(1, ps) prior_p.index.name = 'p'
from utils import make_joint joint_prior = make_joint(prior_N, prior_p)
def likelihood_round1(ps, n, k, s, joint_prior): like = joint_prior.copy() for N in joint_prior: like[N] = binom.pmf(n[0], N, ps) return like
like1 = likelihood_round1(ps, n, k, s, joint_prior)
def likelihood_round2(ps, n, k, s, joint_prior): like = joint_prior.copy() for N in joint_prior: k0 = N - s[0] like[N] = (binom.pmf(k[0,1], k0, ps) * binom.pmf(k[1,1], k[1], ps)) return like
like2 = likelihood_round2(ps, n, k, s, joint_prior) like2.to_numpy().sum()
0.40773670523278144
joint_posterior2 = joint_prior * like1 * like2 normalize(joint_posterior2)
plot_contour(joint_posterior2) decorate(title='Joint posterior of N and p')
Image in a Jupyter notebook
marginal_N = marginal(joint_posterior2, 0) marginal_N.plot() decorate(xlabel='N', ylabel='PDF', title='Posterior marginal distribution of N')
Image in a Jupyter notebook
marginal_N.mean(), marginal_N.credible_interval(0.9)
(342.19942566476345, array([296., 396.]))
marginal_p = marginal(joint_posterior2, 1) marginal_p.plot() decorate(xlabel='p', ylabel='PDF', title='Posterior marginal distribution of p')
Image in a Jupyter notebook
def likelihood_round3(ps, n, k, s, joint_prior): like = joint_prior.copy() for N in joint_prior: k00 = N - s[1] like[N] = (binom.pmf(k[0,0,1], k00, ps) * binom.pmf(k[0,1,1], k[0,1], ps) * binom.pmf(k[1,0,1], k[1,0], ps) * binom.pmf(k[1,1,1], k[1,1], ps)) return like
like3 = likelihood_round3(ps, n, k, s, joint_prior) like3.to_numpy().sum()
1.7620042777092345e-07
joint_posterior3 = joint_posterior2 * like3 normalize(joint_posterior3)
plot_contour(joint_posterior3) decorate(title='Joint posterior of N and p')
Image in a Jupyter notebook
marginal_N = marginal(joint_posterior3, 0) marginal_N.plot() decorate(xlabel='N', ylabel='PDF', title='Posterior marginal distribution of N')
Image in a Jupyter notebook
marginal_N.mean(), marginal_N.credible_interval(0.9)
(391.0058839098183, array([356., 431.]))
marginal_p = marginal(joint_posterior3, 1) marginal_p.plot() decorate(xlabel='p', ylabel='PDF', title='Posterior marginal distribution of p')
Image in a Jupyter notebook
joint_plot(joint_posterior3)
Image in a Jupyter notebook

Spina bifida

data_sb = [-100, 60, 49, 4, 247, 112, 142, 12] num_observed(data_sb)
626
def make_stats(data, num_rounds): index = pd.MultiIndex.from_product([[0, 1]]*num_rounds) K = pd.Series(data, index) n = pd.Series(0, range(num_rounds)) for level in n.index: n[level] = K.xs(1, level=level).sum() t = [K.sum(level=list(range(i+1))) for i in range(num_rounds)] s = [num_observed(Kx) for Kx in t] k = pd.concat(t) return n, k, s
n, k, s = make_stats(data_sb, 3)
n
0 513 1 207 2 188 dtype: int64
k
0 13 1 513 (0, 0) -40 (0, 1) 53 (1, 0) 359 (1, 1) 154 (0, 0, 0) -100 (0, 0, 1) 60 (0, 1, 0) 49 (0, 1, 1) 4 (1, 0, 0) 247 (1, 0, 1) 112 (1, 1, 0) 142 (1, 1, 1) 12 dtype: int64
s
[513, 566, 626]
Ns = np.arange(s[2], 1000, 5) prior_N = Pmf(1, Ns) prior_N.index.name = 'N' prior_N.shape
(75,)
probs0 = np.linspace(0.5, 1.0, 51) prior_p0 = Pmf(1, probs0) prior_p0.index.name = 'p0' prior_p0.head()
p0 0.50 1 0.51 1 0.52 1 0.53 1 0.54 1 dtype: int64
probs1 = np.linspace(0.1, 0.5, 41) prior_p1 = Pmf(1, probs1) prior_p1.index.name = 'p1' prior_p1.head()
p1 0.10 1 0.11 1 0.12 1 0.13 1 0.14 1 dtype: int64
probs2 = np.linspace(0.1, 0.4, 31) prior_p2 = Pmf(1, probs2) prior_p2.index.name = 'p2' prior_p2.head()
p2 0.10 1 0.11 1 0.12 1 0.13 1 0.14 1 dtype: int64
def make_joint3(prior0, prior1, prior2): joint2 = make_joint(prior0, prior1) joint2_pmf = Pmf(joint2.transpose().stack()) joint3 = make_joint(prior2, joint2_pmf) return joint3
joint_prior = make_joint3(prior_p0, prior_p1, prior_N) joint_prior.head()
likelihood = joint_prior.copy() Ns = joint_prior.columns for (p0, p1) in joint_prior.index: like0 = binom.pmf(k[1], Ns, p0) k0 = Ns - s[0] like1 = binom.pmf(k[0,1], k0, p1) * binom.pmf(k[1,1], k[1], p1) likelihood.loc[p0, p1] = like0 * like1 likelihood.to_numpy().sum()
0.018731276129625645
from utils import normalize joint_posterior = joint_prior * likelihood normalize(joint_posterior) joint_posterior.shape
(2091, 75)
from utils import marginal posterior_N = marginal(joint_posterior, 0) posterior_N.plot() decorate(xlabel='N', ylabel='PDF', title='Posterior for N after two rounds') posterior_N.mean(), posterior_N.credible_interval(0.9)
(692.2431131826501, array([656., 736.]))
Image in a Jupyter notebook
posterior_pmf = marginal(joint_posterior, 1) posterior_pmf.head()
p0 p1 0.5 0.10 2.947207e-37 0.11 1.861223e-32 0.12 1.953523e-28 0.13 4.500420e-25 0.14 2.829563e-22 dtype: float64
posterior_joint_ps = posterior_pmf.unstack().transpose() posterior_joint_ps.head()
plot_contour(posterior_joint_ps) decorate(title='Posterior joint distribution for p1 and p2')
Image in a Jupyter notebook
posterior_p1 = marginal(posterior_joint_ps, 0) posterior_p2 = marginal(posterior_joint_ps, 1) posterior_p1.plot(label='p1') posterior_p2.plot(label='p2') decorate(xlabel='Probability of finding a bug', ylabel='PDF', title='Posterior marginal distributions of p1 and p2')
Image in a Jupyter notebook
joint3 = make_joint(prior_p2, posterior_pmf) joint3.head()
joint3_pmf = Pmf(joint3.stack()) joint3_pmf.head()
p0 p1 p2 0.5 0.1 0.10 2.947207e-37 0.11 2.947207e-37 0.12 2.947207e-37 0.13 2.947207e-37 0.14 2.947207e-37 dtype: float64
prior4 = make_joint(posterior_N, joint3_pmf) prior4.head()
prior4.shape
(64821, 75)
joint2 = make_joint(posterior_N, prior_p2) joint2.head()
like2 = joint2.copy() Ns = joint2.columns for p2 in joint2.index: k00 = Ns - s[1] like = (binom.pmf(k[0,0,1], k00, p2) * binom.pmf(k[0,1,1], k[0,1], p2) * binom.pmf(k[1,0,1], k[1,0], p2) * binom.pmf(k[1,1,1], k[1,1], p2)) like2.loc[p2] = like like2.to_numpy().sum()
2.874211433485911e-13
like2.head()
like4 = prior4.copy() for (p0, p1, p2) in prior4.index: like4.loc[p0, p1, p2] = like2.loc[p2] like4.to_numpy().sum()
6.009976107419042e-10
posterior4 = prior4 * like4 normalize(posterior4)
marginal_N = marginal(posterior4, 0) marginal_N.plot() decorate(xlabel='N', ylabel='PDF', title='Posterior for N after three rounds') marginal_N.mean(), marginal_N.credible_interval(0.9)
(764.7788307540033, array([731., 801.]))
Image in a Jupyter notebook
posterior_p012 = marginal(posterior4, 1) posterior_p012.unstack().head()
posterior_p2 = marginal(posterior_p012.unstack(), 0) posterior_p2.plot()
Image in a Jupyter notebook
posterior_p01 = marginal(posterior_p012.unstack(), 1) joint_plot(posterior_p01.unstack().transpose())
Image in a Jupyter notebook

Diabetes

data4 = [-10000, 10, 182, 8, 74, 7, 20, 14, 709, 12, 650, 46, 104, 18, 157, 58] num_observed(data4)
2069
n, k, s = make_stats(data4, 4)
n
0 1754 1 452 2 1135 3 173 dtype: int64
k
0 -9685 1 1754 (0, 0) -9800 (0, 1) 115 (1, 0) 1417 (1, 1) 337 (0, 0, 0) -9990 (0, 0, 1) 190 (0, 1, 0) 81 (0, 1, 1) 34 (1, 0, 0) 721 (1, 0, 1) 696 (1, 1, 0) 122 (1, 1, 1) 215 (0, 0, 0, 0) -10000 (0, 0, 0, 1) 10 (0, 0, 1, 0) 182 (0, 0, 1, 1) 8 (0, 1, 0, 0) 74 (0, 1, 0, 1) 7 (0, 1, 1, 0) 20 (0, 1, 1, 1) 14 (1, 0, 0, 0) 709 (1, 0, 0, 1) 12 (1, 0, 1, 0) 650 (1, 0, 1, 1) 46 (1, 1, 0, 0) 104 (1, 1, 0, 1) 18 (1, 1, 1, 0) 157 (1, 1, 1, 1) 58 dtype: int64
s
[1754, 1869, 2059, 2069]