Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/thinkbayes2
Path: blob/master/examples/dirichlet_soln.ipynb
1901 views
Kernel: Python 3

Think Bayes

Copyright 2018 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT

# Configure Jupyter so figures appear in the notebook %matplotlib inline # Configure Jupyter to display the assigned value after an assignment %config InteractiveShell.ast_node_interactivity='last_expr_or_assign' import numpy as np import pandas as pd import matplotlib.pyplot as plt # import classes from thinkbayes2 from thinkbayes2 import Pmf, Cdf, Suite, Joint import thinkplot

Lions and Tigers and Bears

Suppose we visit a wild animal preserve where we know that the only animals are lions and tigers and bears, but we don't know how many of each there are.

During the tour, we see 3 lions, 2 tigers and one bear. Assuming that every animal had an equal chance to appear in our sample, estimate the prevalence of each species.

What is the probability that the next animal we see is a bear?

Grid algorithm

I'll start with a grid algorithm, enumerating the space of prevalences, p1, p2, and p3, that add up to 1, and computing the likelihood of the data for each triple of prevalences.

class LionsTigersBears(Suite, Joint): def Likelihood(self, data, hypo): """ data: string 'L' , 'T', 'B' hypo: p1, p2, p3 """ # Fill this in.
# Solution class LionsTigersBears(Suite, Joint): def Likelihood(self, data, hypo): """ data: string 'L' , 'T', 'B' hypo: p1, p2, p3 """ p1, p2, p3 = hypo if data == 'L': return p1 if data == 'T': return p2 if data == 'B': return p3
ps = np.linspace(0, 1, 101);

Here's a simple way to find eligible triplets, but it is inefficient, and it runs into problems with floating-point approximations.

from itertools import product def enumerate_triples(ps): for p1, p2, p3 in product(ps, ps, ps): if p1+p2+p3 == 1: yield p1, p2, p3

As an exercise, write a better version of enumerate_triples.

# Solution from itertools import product def enumerate_triples(ps): for p1, p2 in product(ps, ps): if p1 + p2 > 1: continue p3 = 1 - p1 - p2 yield p1, p2, p3

Now we can initialize the suite.

suite = LionsTigersBears(enumerate_triples(ps));

Here are functions for displaying the distributions

def plot_marginal_pmfs(joint): pmf_lion = joint.Marginal(0) pmf_tiger = joint.Marginal(1) pmf_bear = joint.Marginal(2) thinkplot.Pdf(pmf_lion, label='lions') thinkplot.Pdf(pmf_tiger, label='tigers') thinkplot.Pdf(pmf_bear, label='bears') thinkplot.decorate(xlabel='Prevalence', ylabel='PMF')
def plot_marginal_cdfs(joint): pmf_lion = joint.Marginal(0) pmf_tiger = joint.Marginal(1) pmf_bear = joint.Marginal(2) thinkplot.Cdf(pmf_lion.MakeCdf(), label='lions') thinkplot.Cdf(pmf_tiger.MakeCdf(), label='tigers') thinkplot.Cdf(pmf_bear.MakeCdf(), label='bears') thinkplot.decorate(xlabel='Prevalence', ylabel='CDF')

Here are the prior distributions

plot_marginal_cdfs(suite)
Image in a Jupyter notebook

Now we can do the update.

for data in 'LLLTTB': suite.Update(data)

And here are the posteriors.

plot_marginal_cdfs(suite)
Image in a Jupyter notebook

To get the predictive probability of a bear, we can take the mean of the posterior marginal distribution:

suite.Marginal(2).Mean()
0.22232592246925062

Or we can do a pseudo-update and use the total probability of the data.

suite.Copy().Update('B')
0.22232592246925043

Using the Dirichlet object

The Dirichlet distribution is the conjugate prior for this likelihood function, so we can use the Dirichlet object to do the update.

The following is a monkey patch that gives Dirichlet objects a Marginal method.

from thinkbayes2 import Dirichlet def DirichletMarginal(dirichlet, i): return dirichlet.MarginalBeta(i).MakePmf() Dirichlet.Marginal = DirichletMarginal

Here are the priors:

dirichlet = Dirichlet(3) plot_marginal_cdfs(dirichlet)
Image in a Jupyter notebook

Here's the update.

dirichlet.Update((3, 2, 1))

Here are the posterior PDFs.

plot_marginal_pmfs(dirichlet)
Image in a Jupyter notebook

And the CDFs.

plot_marginal_cdfs(dirichlet)
Image in a Jupyter notebook

And we can confirm that we get the same results as the grid algorithm.

thinkplot.PrePlot(6) plot_marginal_cdfs(dirichlet) plot_marginal_cdfs(suite)
Image in a Jupyter notebook

MCMC

Exercise: Implement this model using MCMC. You might want to start with this example.

import warnings warnings.simplefilter('ignore', FutureWarning) import pymc3 as pm

Here's the data.

observed = [0,0,0,1,1,2] k = len(Pmf(observed)) a = np.ones(k)
array([1., 1., 1.])

Here's the MCMC model:

# Solution model = pm.Model() with model: ps = pm.Dirichlet('ps', a, shape=a.shape) xs = pm.Categorical('xs', ps, observed=observed, shape=1) model
ps∼Dirichlet(a=array)xs∼Categorical(p=f(f(ps.T, f(f(ps)))))\begin{array}{rcl} ps &\sim & \text{Dirichlet}(\mathit{a}=array)\\\text{xs} &\sim & \text{Categorical}(\mathit{p}=f(f(\text{ps.T},~f(f(\text{ps}))))) \end{array}
# Solution with model: start = pm.find_MAP() step = pm.Metropolis() trace = pm.sample(1000, start=start, step=step, tune=1000)
logp = -5.8985, ||grad|| = 1.118: 100%|██████████| 7/7 [00:00<00:00, 1525.99it/s] Multiprocess sampling (4 chains in 4 jobs) Metropolis: [ps] Sampling 4 chains: 100%|██████████| 8000/8000 [00:01<00:00, 6107.20draws/s] The number of effective samples is smaller than 25% for some parameters.

Check the traceplot

pm.traceplot(trace);
Image in a Jupyter notebook

And let's see the results.

def plot_trace_cdfs(trace): rows = trace['ps'].transpose() cdf_lion = Cdf(rows[0]) cdf_tiger = Cdf(rows[1]) cdf_bear = Cdf(rows[2]) thinkplot.Cdf(cdf_lion, label='lions') thinkplot.Cdf(cdf_tiger, label='tigers') thinkplot.Cdf(cdf_bear, label='bears') thinkplot.decorate(xlabel='Prevalence', ylabel='CDF')
plot_trace_cdfs(trace)
Image in a Jupyter notebook

And compare them to what we got with Dirichlet:

thinkplot.PrePlot(6) plot_marginal_cdfs(dirichlet) plot_trace_cdfs(trace)
Image in a Jupyter notebook

Using a Multinomial distribution

Here's another solution that uses a Multinomial distribution instead of a Categorical. In this case, we represent the observed data using just the counts, [3, 2, 1], rather than a specific sequence of observations [0,0,0,1,1,2].

I suspect that this is a better option; because it uses a less specific representation of the data (without losing any information), I would expect the probability space to be easier to search.

This solution is based on this excellent notebook from Will Koehrsen.

animals = ['lions', 'tigers', 'bears'] c = np.array([3, 2, 1]) a = np.array([1, 1, 1])
array([1, 1, 1])
warnings.simplefilter('ignore', UserWarning) with pm.Model() as model: # Probabilities for each species ps = pm.Dirichlet('ps', a=a, shape=3) # Observed data is a multinomial distribution with 6 trials xs = pm.Multinomial('xs', n=6, p=ps, shape=3, observed=c)
model
ps∼Dirichlet(a=array)xs∼Multinomial(n=6,p=f(ps, f(f(ps))))\begin{array}{rcl} ps &\sim & \text{Dirichlet}(\mathit{a}=array)\\xs &\sim & \text{Multinomial}(\mathit{n}=6, \mathit{p}=f(\text{ps},~f(f(\text{ps})))) \end{array}
with model: # Sample from the posterior trace = pm.sample(draws=1000, tune=1000)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [ps] Sampling 4 chains: 100%|██████████| 8000/8000 [00:02<00:00, 3157.96draws/s]
pm.traceplot(trace);
Image in a Jupyter notebook
thinkplot.PrePlot(6) plot_marginal_cdfs(dirichlet) plot_trace_cdfs(trace)
Image in a Jupyter notebook

The results look good. We can use summary to get the posterior means, and other summary stats.

summary = pm.summary(trace) summary.index = animals summary

We can also use plot_posterior to get a better view of the results.

ax = pm.plot_posterior(trace, varnames = ['ps']); for i, a in enumerate(animals): ax[i].set_title(a)
Image in a Jupyter notebook