\documentclass[12pt]{book}
\title{Think Bayes}
\author{Allen B. Downey}
\newcommand{\thetitle}{Think Bayes}
\newcommand{\thesubtitle}{Bayesian Statistics Made Simple}
\newcommand{\theauthor}{Allen B. Downey}
\newcommand{\theversion}{Version 2.1.0}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{setspace}
\usepackage{amsmath}
\usepackage{amsthm}
\newtheoremstyle{exercise}
{12pt}
{12pt}
{}
{}
{\bfseries}
{}
{12pt}
{}
\theoremstyle{exercise}
\newtheorem{exercise}{Exercise}[chapter]
\newif\ifplastex
\plastexfalse
\ifplastex
\makeindex
\usepackage{localdef}
\usepackage{url}
\renewcommand{\href}[2]{\url{#1}}
\makeatletter
\newcount\anchorcnt
\newcommand*{\Anchor}[1]{
\@bsphack
\Hy@GlobalStepCount\anchorcnt
\edef\@currentHref{anchor.\the\anchorcnt}
\Hy@raisedlink{\hyper@anchorstart{\@currentHref}\hyper@anchorend}
\M@gettitle{}\label{#1}
\@esphack
}
\makeatother
\newcommand{\py}[1]{{\tt #1}}
\newcommand{\textcolor}[1]{\relax}
\else
\usepackage{comment}
\excludecomment{htmlonly}
\includecomment{latexonly}
\input{latexonly.tex}
\fi
\begin{document}
\frontmatter
\ifplastex
\maketitle
\else
\begin{latexonly}
\thispagestyle{empty}
\begin{flushright}
\vspace*{2.0in}
\begin{spacing}{3}
{\huge \thetitle} \\
{\Large \thesubtitle}
\end{spacing}
\vspace{0.25in}
\theversion
\vfill
\end{flushright}
\newpage
\thispagestyle{empty}
\quad
\newpage
\thispagestyle{empty}
\begin{flushright}
\vspace*{2.0in}
\begin{spacing}{3}
{\huge \thetitle} \\
{\Large \thesubtitle}
\end{spacing}
\vspace{0.25in}
\theversion
\vspace{1in}
{\Large \theauthor}
\vspace{0.5in}
{\Large Green Tea Press}
{\small Needham, Massachusetts}
\vfill
\end{flushright}
\newpage
\thispagestyle{empty}
Copyright \copyright ~2020 \theauthor.
\vspace{0.2in}
\begin{flushleft}
Green Tea Press \\
9 Washburn Ave \\
Needham, MA 02492
\end{flushleft}
Permission is granted to copy, distribute, and/or modify this work under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, which is available at \url{https://creativecommons.org/licenses/by-nc-sa/4.0/}.
The \LaTeX\ source for this book is available from
\url{http://greenteapress.com/thinkbayes2}.
\cleardoublepage
\setcounter{tocdepth}{1}
\tableofcontents
\end{latexonly}
\begin{htmlonly}
\vspace{1em}
{\Large \thetitle: \thesubtitle}
{\large \theauthor}
\theversion
\vspace{1em}
Copyright \copyright ~2020 \theauthor.
Permission is granted to copy, distribute, and/or modify this document
under the terms of the Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
Unported License, which is available at
\url{http://creativecommons.org/licenses/by-nc-sa/4.0/}.
\setcounter{chapter}{-1}
\end{htmlonly}
\fi
\chapter{Preface}
\label{preface}
\section{My theory, which is mine}
The premise of this book, and the other books in the {\it Think X}
series, is that if you know how to program, you
can use that skill to learn other topics.
Most books on Bayesian statistics use mathematical notation and
present ideas in terms of mathematical concepts like calculus.
This book uses Python code instead of math, and discrete approximations
instead of continuous mathematics. As a result, what would
be an integral in a math book becomes a summation, and
most operations on probability distributions are simple loops.
I think this presentation is easier to understand, at least for people with
programming skills. It is also more general, because when we make
modeling decisions, we can choose the most appropriate model without
worrying too much about whether the model lends itself to conventional
analysis.
Also, it provides a smooth development path from simple examples to
real-world problems. Chapter~\ref{estimation} is a good example. It
starts with a simple example involving dice, one of the staples of
basic probability. From there it proceeds in small steps to the
locomotive problem, which I borrowed from Mosteller's {\it
Fifty Challenging Problems in Probability with Solutions}, and from
there to the German tank problem, a famously successful application of
Bayesian methods during World War II.
\section{Modeling and approximation}
Most chapters in this book are motivated by a real-world problem, so
they involve some degree of modeling. Before we can apply Bayesian
methods (or any other analysis), we have to make decisions about which
parts of the real-world system to include in the model and which
details we can abstract away. \index{modeling}
For example, in Chapter~\ref{prediction}, the motivating problem is to
predict the winner of a hockey game. I model goal-scoring as a
Poisson process, which implies that a goal is equally likely at any
point in the game. That is not exactly true, but it is probably a
good enough model for most purposes.
\index{Poisson process}
In Chapter~\ref{evidence} the motivating problem is interpreting SAT
scores (the SAT is a standardized test used for college admissions in
the United States). I start with a simple model that assumes that all
SAT questions are equally difficult, but in fact the designers of the
SAT deliberately include some questions that are relatively easy and
some that are relatively hard. I present a second model that accounts
for this aspect of the design, and show that it doesn't have a big
effect on the results after all.
I think it is important to include modeling as an explicit part
of problem solving because it reminds us to think about modeling
errors (that is, errors due to simplifications and assumptions
of the model).
Many of the methods in this book are based on discrete distributions,
which makes some people worry about numerical errors. But for
real-world problems, numerical errors are almost always
smaller than modeling errors.
Furthermore, the discrete approach often allows better modeling
decisions, and I would rather have an approximate solution
to a good model than an exact solution to a bad model.
On the other hand, continuous methods sometimes yield performance
advantages---for example by replacing a linear- or quadratic-time
computation with a constant-time solution.
So I recommend a general process with these steps:
\begin{enumerate}
\item While you are exploring a problem, start with simple models and
implement them in code that is clear, readable, and demonstrably
correct. Focus your attention on good modeling decisions, not
optimization.
\item Once you have a simple model working, identify the
biggest sources of error. You might need to increase the number of
values in a discrete approximation, or increase the number of
iterations in a Monte Carlo simulation, or add details to the model.
\item If the performance of your solution is good enough for your
application, you might not have to do any optimization. But if you
do, there are two approaches to consider. You can review your code
and look for optimizations; for example, if you cache previously
computed results you might be able to avoid redundant computation.
Or you can look for analytic methods that yield computational
shortcuts.
\end{enumerate}
One benefit of this process is that Steps 1 and 2 tend to be fast, so you
can explore several alternative models before investing heavily in any
of them.
Another benefit is that if you get to Step 3, you will be starting
with a reference implementation that is likely to be correct,
which you can use for regression testing (that is, checking that the
optimized code yields the same results, at least approximately).
\index{regression testing}
\section{Working with the code}
\label{codeinfo}
There are several ways you can work with the code in this book:
\begin{itemize}
\item If you don't have a programming environment where you can run Jupyter notebooks, and you don't want to create one, you can run the notebooks on Colab, which is an online service provided by Google. Colab let's you run Jupyter notebooks in a browser without installing anything.
\item If you have Python and Jupyter installed, you can download the code and run it on your computer.
\end{itemize}
To run the notebooks on Colab, you can follow the links at the end of each chapter, or you can start from \url{}, which has links to all of the notebooks.
If you already have Python and Jupyter, you can download the code from
my Git repository, at \url{https://github.com/AllenDowney/ThinkBayes}. Git is a version control system that allows you to keep track of the files that make up a project.
A collection of files under Git's control is
called a ``repository''.
GitHub is a hosting service that provides storage for Git repositories and a convenient web interface.
\index{repository}
\index{Git}
\index{GitHub}
The GitHub homepage for my repository provides several ways to download the code:
\begin{itemize}
\item You can create a copy of my repository
on GitHub by pressing the {\sf Fork} button. If you don't already
have a GitHub account, you'll need to create one. After forking, you'll
have your own repository on GitHub that you can use to keep track
of code you write while working on this book. Then you can
clone the repo, which means that you copy the files
to your computer.
\index{fork}
\item Or you could clone
my repository. You don't need a GitHub account to do this, but you
won't be able to write your changes back to GitHub.
\index{clone}
\item If you don't want to use Git at all, you can download the files
in a Zip file using the button in the lower-right corner of the
GitHub page. Or you can download the Zip file from \url{}.
\end{itemize}
If you don't have Python and Jupyter installed already, I recommend you install Anaconda, which is a free Python distribution that includes
all the packages you'll need to run the code (and lots more).
I found Anaconda easy to install. By default it installs files in your home directory, so you don't need administrator privileges. You can download Anaconda from \url{https://www.anaconda.com/products/individual}.
\index{Anaconda}
If you install Anaconda, you will have most of the packages you need to run the code in this book.
To make sure you have everything you need (and the right versions), the best option is to create a Conda environment. And the best way to do that is to use the command line.
If you are not familiar with the command line, you might want to run the notebooks on Colab.
\begin{enumerate}
\item After downloading my repository, you should have a directory named \py{ThinkBayes2}. Use \py{cd} to move into that directory.
\item Use \py{ls} to confirm that you have a file named \py{environment.yml}. It lists the packages you need.
\item Run the following command to create an environment:
\begin{verbatim}
conda env create -f environment.yml
\end{verbatim}
\item Run the following command to activate the environment you just created:
\begin{verbatim}
conda activate ThinkBayes2
\end{verbatim}
\item To test your environment and make sure it has everything we need, run the following command:
\begin{verbatim}
python test_env.py
\end{verbatim}
\end{enumerate}
If you don't want to create an environment just for this book, you can install what you need using Conda.
The following commands should get everything you need:
\begin{verbatim}
conda install python jupyter pandas scipy matplotlib
pip install empiricaldist
\end{verbatim}
If you don't want to use Anaconda, you will need the following
packages:
\begin{itemize}
\item Jupyter to run the notebooks, \url{https://jupyter.org/};
\index{Jupyter}
\item NumPy for basic numerical computation, \url{http://www.numpy.org/};
\index{NumPy}
\item SciPy for scientific computation, \url{http://www.scipy.org/};
\index{SciPy}
\item Pandas for working with data, \url{https://pandas.pydata.org/};
\index{Pandas}
\item matplotlib for visualization, \url{http://matplotlib.org/};
\index{matplotlib}
\item empiricaldist for representing distributions, \url{};
\index{empiricaldist}.
\end{itemize}
Although these are commonly used packages, they are not included with
all Python installations, and they can be hard to install in some
environments. If you have trouble installing them, I
recommend using Anaconda or one of the other Python distributions
that include these packages.
\index{installation}
\section{Code style}
Experienced Python programmers will notice that the code in this
book does not comply with PEP 8, which is the most common
style guide for Python (\url{http://www.python.org/dev/peps/pep-0008/}).
\index{PEP 8}
Specifically, PEP 8 calls for lowercase function names with
underscores between words, \verb"like_this". In this book and
the accompanying code, function and method names begin with
a capital letter and use camel case, \verb"LikeThis".
I broke this rule because I developed some of the code
while I was a Visiting Scientist at Google, so I followed
the Google style guide, which deviates from PEP 8 in a few
places. Once I got used to Google style, I found that I liked
it. And at this point, it would be too much trouble to change.
Also on the topic of style, I write ``Bayes's theorem''
with an {\it s} after the apostrophe, which is preferred in some
style guides and deprecated in others. I don't have a strong
preference. I had to choose one, and this is the one I chose.
And finally one typographical note: throughout the book, I use
PMF and CDF for the mathematical concept of a probability
mass function or cumulative distribution function, and Pmf and Cdf
to refer to the Python objects I use to represent them.
\section{Prerequisites}
There are several excellent modules for doing Bayesian statistics in
Python, including \py{pymc} and OpenBUGS. I chose not to use them
for this book because you need a fair amount of background knowledge
to get started with these modules, and I want to keep the
prerequisites minimal. If you know Python and a little bit about
probability, you are ready to start this book.
Chapter~\ref{intro} is about probability and Bayes's theorem; it has
no code. Chapter~\ref{compstat} introduces \py{Pmf}, a thinly disguised
Python dictionary I use to represent a probability mass function
(PMF). Then Chapter~\ref{estimation} introduces \py{Suite}, a kind
of Pmf that provides a framework for doing Bayesian updates.
In some of the later chapters, I use
analytic distributions including the Gaussian (normal) distribution,
the exponential and Poisson distributions, and the beta distribution.
In Chapter~\ref{species} I break out the less-common Dirichlet
distribution, but I explain it as I go along. If you are not familiar
with these distributions, you can read about them on Wikipedia. You
could also read the companion to this book, {\it Think Stats}, or an
introductory statistics book (although I'm afraid most of them take
a mathematical approach that is not particularly helpful for practical
purposes).
\section*{Contributor List}
If you have a suggestion or correction, please send email to
{\it downey@allendowney.com}. If I make a change based on your
feedback, I will add you to the contributor list
(unless you ask to be omitted).
\index{contributors}
If you include at least part of the sentence the
error appears in, that makes it easy for me to search. Page and
section numbers are fine, too, but not as easy to work with.
Thanks!
\small
\begin{itemize}
\item First, I have to acknowledge David MacKay's excellent book,
{\it Information Theory, Inference, and Learning Algorithms}, which is
where I first came to understand Bayesian methods. With his
permission, I use several problems from
his book as examples.
\item This book also benefited from my interactions with Sanjoy
Mahajan, especially in fall 2012, when I audited his class on
Bayesian Inference at Olin College.
\item I wrote parts of this book during project nights with the Boston
Python User Group, so I would like to thank them for their
company and pizza.
\item Olivier Yiptong sent several helpful suggestions.
\item Yuriy Pasichnyk found several errors.
\item Kristopher Overholt sent a long list of corrections and suggestions.
\item Max Hailperin suggested a clarification in Chapter~\ref{intro}.
\item Markus Dobler pointed out that drawing cookies from a bowl
with replacement is an unrealistic scenario.
\item In spring 2013, students in my class, Computational Bayesian
Statistics, made many helpful corrections and suggestions: Kai
Austin, Claire Barnes, Kari Bender, Rachel Boy, Kat Mendoza, Arjun
Iyer, Ben Kroop, Nathan Lintz, Kyle McConnaughay, Alec Radford,
Brendan Ritter, and Evan Simpson.
\item Greg Marra and Matt Aasted helped me clarify the discussion of
{\it The Price is Right} problem.
\item Marcus Ogren pointed out that the original statement of the
locomotive problem was ambiguous.
\item Jasmine Kwityn and Dan Fauxsmith at O'Reilly Media proofread the
book and found many opportunities for improvement.
\item Linda Pescatore found a typo and made some helpful suggestions.
\item Tomasz Miasko sent many excellent corrections and suggestions.
\end{itemize}
Other people who spotted typos and small errors include
Tom Pollard,
Paul A. Giannaros,
Jonathan Edwards,
George Purkins,
Robert Marcus,
Ram Limbu,
James Lawry,
Ben Kahle,
Jeffrey Law, and
Alvaro Sanchez.
\normalsize
\newpage
\begin{latexonly}
\tableofcontents
\newpage
\end{latexonly}
\mainmatter
\newcommand{\PMF}{\mathrm{PMF}}
\newcommand{\PDF}{\mathrm{PDF}}
\newcommand{\CDF}{\mathrm{CDF}}
\newcommand{\ICDF}{\mathrm{ICDF}}
\newcommand{\p}[1]{\ensuremath{\mathrm{p}(#1)}}
\newcommand{\odds}[1]{\ensuremath{\mathrm{o}(#1)}}
\newcommand{\T}[1]{\mbox{#1}}
\newcommand{\AND}{~\mathrm{and}~}
\newcommand{\NOT}{\mathrm{not}~}
\chapter{Bayes's Theorem}
\label{intro}
\section{Conditional probability}
The fundamental idea behind all Bayesian statistics is Bayes's theorem,
which is surprisingly easy to derive, provided that you understand
conditional probability. So we'll start with probability, then
conditional probability, then Bayes's theorem, and on to Bayesian
statistics.
\index{conditional probability}
\index{probability!conditional}
A probability is a number between 0 and 1 (including both) that
represents a degree of belief in a fact or prediction. The value
1 represents certainty that a fact is true, or that a prediction
will come true. The value 0 represents certainty
that the fact is false.
\index{degree of belief}
Intermediate values represent degrees of certainty. The value 0.5,
often written as 50\%, means that a predicted outcome is
as likely to happen as not.
For example, the probability that a tossed coin lands ``heads'' is close to 50\%.
\index{coin toss}
A conditional probability is a probability based on some relevant information. For example, suppose I toss two coins.
The probability that both coins land heads is 25\%.
But suppose I toss two coins and, without showing you the result, tell you that at least one of the coins in heads.
What is the probability that both are heads?
The answer is 1/3.
Here's how I got that: when I toss the coins, there are four equally likely outcomes: heads-heads, heads-tails, tails-heads, and tails-tails.
When I tell you that at least one coin is heads, that eliminates one outcome, tails-tails.
The remaining outcomes are heads-heads, heads-tails, and tails-heads, and they are still equally likely.
So the probability of heads-heads is 1/3.
That argument is correct, but if you don't find it entirely convincing, we'll come back to this problem and solve it more carefully using Bayes's Theorem.
In this example, we computed the conditional probability of two heads, given the information that at least one coin is heads.
The usual notation for conditional probability is $\p{A|B}$, which
is the probability of $A$ given that $B$ is true. In this
example, $A$ represents the two heads, and $B$ is the condition that at least one coin is heads.
\section{Conjoint probability}
{\bf Conjoint probability} is a fancy way to say the probability that
two things are true. I'll use the notation $\p{A \AND B}$ to mean the
probability that $A$ and $B$ are both true.
\index{conjoint probability}
\index{probability!conjoint}
If you learned about probability in the context of coin tosses and
dice, you might have learned the following formula:
\[ \p{A \AND B} = \p{A}~\p{B} \quad\quad\mbox{WARNING: not always true}\]
For example, if I toss two coins, and $A$ means the first coin lands
face up, and $B$ means the second coin lands face up, then $\p{A} =
\p{B} = 0.5$, and sure enough, $\p{A \AND B} = \p{A}~\p{B} = 0.25$.
But this formula only works because in this case $A$ and $B$ are
independent; that is, knowing the first outcome does
not change the probability of the second. Or, more formally,
\p{B|A} = \p{B}.
\index{independence}
\index{dependence}
Here is a different example where the outcomes are not independent.
Suppose that $A$ means that it rains today and $B$ means that it
rains tomorrow. If I know that it rained today, it is more likely
that it will rain tomorrow, so $\p{B|A} > \p{B}$.
In general, the probability of a conjunction is
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
for any $A$ and $B$. So if the chance of rain on any given day
is 0.5, the chance of rain on two consecutive days is not
0.25, but probably a bit higher.
\section{The cookie problem}
\label{cookie}
\index{Bayes's theorem}
\index{cookie problem}
We'll get to Bayes's theorem soon, but I want to motivate it with an
example called the cookie problem.\footnote{Based on an example from
\url{http://en.wikipedia.org/wiki/Bayes'_theorem} that is no longer
there.}
\begin{quote}
Suppose there are two bowls of cookies.
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
Bowl 2 contains 20 of each.
Now suppose you choose one of the bowls at random and, without
looking, select a cookie at random.
The cookie is vanilla.
What is the probability that it came from Bowl 1?
\end{quote}
This is a conditional probability; we want $\p{\T{Bowl 1} |
\T{vanilla}}$, but it is not obvious how to compute it. If I asked a
different question---the probability of a vanilla cookie given Bowl
1---it would be easy:
\[ \p{\T{vanilla} | \T{Bowl 1}} = 3/4 \]
Sadly, $\p{A|B}$ is {\em not} the same as $\p{B|A}$, but there
is a way to get from one to the other: Bayes's theorem.
\section{Bayes's theorem}
\index{Bayes's theorem!derivation}
\index{conjunction}
Here's how we derive Bayes's theorem.
We'll start with the probability of a conjunction:
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
Since we have not said anything about what $A$ and $B$ mean, they
are interchangeable.
Interchanging them yields
\[ \p{B \AND A} = \p{B}~\p{A|B} \]
Also, conjunction is commutative; that is
\[ \p{A \AND B} = \p{B \AND A} \]
That's all we need. Pulling those pieces together, we get
\[ \p{B}~\p{A|B} = \p{A}~\p{B|A} \]
Which means there are two ways to compute the conjunction.
If you have $\p{A}$, you multiply by the conditional
probability $\p{B|A}$.
Or you can do it the other way around; if you
know \p{B}, you multiply by $\p{A|B}$.
Finally we divide through by $\p{B}$:
\[ \p{A|B} = \frac{\p{A}~\p{B|A}}{\p{B}} \]
And that's Bayes's theorem! It might not look like much, but
it turns out to be surprisingly powerful.
For example, we can use it to solve the cookie problem. I'll write
$B_1$ for the hypothesis that the cookie came from Bowl 1
and $V$ for the vanilla cookie. Plugging in Bayes's theorem
we get
\[ \p{B_1|V} = \frac{\p{B_1}~\p{V|B_1}}{\p{V}} \]
The term on the left is what we want: the probability of Bowl 1, given
that we chose a vanilla cookie. The terms on the right are:
\begin{itemize}
\item $\p{B_1}$: This is the probability that we chose Bowl 1, unconditioned by what kind of cookie we got. Since the problem says we chose a bowl at random, we can assume $\p{B_1} = 1/2$.
\item $\p{V|B_1}$: This is the probability of getting a vanilla cookie
from Bowl 1, which is 3/4.
\item $\p{V}$: This is the probability of drawing a vanilla cookie from
either bowl. Since we had an equal chance of choosing either bowl
and the bowls contain the same number of cookies, we had the same
chance of choosing any cookie. Between the two bowls there are
50 vanilla and 30 chocolate cookies, so $\p{V} = 5/8$.
\end{itemize}
Putting it together, we have
\[ \p{B_1|V} = \frac{(1/2)~(3/4)}{5/8} \]
which reduces to 3/5. So the vanilla cookie is evidence in favor of
the hypothesis that we chose Bowl 1, because vanilla cookies are more
likely to come from Bowl 1.
\index{evidence}
This example demonstrates one use of Bayes's theorem: it provides
a strategy to get from \p{B|A} to \p{A|B}. This strategy is useful
in cases, like the cookie problem, where it is easier to compute
the terms on the right side of Bayes's theorem than the term on the
left.
\section{The diachronic interpretation}
There is another way to think of Bayes's theorem: it gives us a
way to update the probability of a hypothesis, $H$, in light of
some body of data, $D$.
\index{diachronic interpretation}
This way of thinking about Bayes's theorem is called the
{\bf diachronic interpretation}. ``Diachronic'' means that something
is happening over time; in this case, the probability of the hypotheses changes over time as we see new data.
Rewriting Bayes's theorem with $H$ and $D$ yields:
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
In this interpretation, each term has a name:
\index{prior}
\index{posterior}
\index{likelihood}
\index{normalizing constant}
\begin{itemize}
\item \p{H} is the probability of the hypothesis before we see
the data, called the prior probability, or just {\bf prior}.
\item \p{H|D} is what we want to compute, the probability of
the hypothesis after we see the data, called the {\bf posterior}.
\item \p{D|H} is the probability of the data under the hypothesis,
called the {\bf likelihood}.
\item \p{D} is the {\bf total probability of the data}, under any hypothesis.
\end{itemize}
Sometimes we can compute the prior based on background information. For example, the cookie problem specifies that we choose a bowl at random with equal probability.
In other cases the prior is subjective; that is, reasonable people
might disagree, either because they use different background
information or because they interpret the same information
differently.
\index{subjective prior}
The likelihood is usually the easiest part to compute. In the
cookie problem, if we know which bowl the cookie came from,
we find the probability of a vanilla cookie by counting.
Computing the total probability of the data can be tricky. It is supposed to be the probability of seeing the data under any hypothesis at all, but in the most general case it is hard to nail down what that means.
Most often we simplify things by specifying a set of hypotheses
that are:
\index{mutually exclusive}
\index{collectively exhaustive}
\begin{description}
\item[Mutually exclusive:] At most one hypothesis in
the set can be true, and
\item[Collectively exhaustive:] There are no other
possibilities; at least one of the hypotheses has to be true.
\end{description}
In the cookie problem, there are only two hypotheses---the cookie
came from Bowl 1 or Bowl 2---and they are mutually exclusive and
collectively exhaustive.
\index{total probability}
In that case we can compute \p{D} using the law of total probability,
which says that if there are two exclusive ways that something
might happen, you can add up the probabilities like this:
\[ \p{D} = \p{B_1}~\p{D|B_1} + \p{B_2}~\p{D|B_2} \]
Plugging in the values from the cookie problem, we have
\[ \p{D} = (1/2)~(3/4) + (1/2)~(1/2) = 5/8 \]
which is what we computed earlier by mentally combining the two
bowls.
\section{Bayes Tables}
In the cookie problem we can compute the probability of the data directly, but that's not always the case. In fact, computing the total probability of the data is often the hardest part of the problem.
Fortunately, there is another way to solve problems like this that makes it easier: the Bayes table.
You can write a Bayes table on paper or use a spreadsheet, but for this example I'll use a Pandas DataFrame.
First I'll make empty DataFrame with one row for each hypothesis:
\begin{code}
import pandas as pd
table = pd.DataFrame(index=['Bowl 1', 'Bowl 2'])
\end{code}
Then I'll add columns for the prior probabilities and likelihoods.
\begin{code}
table['prior'] = 1/2, 1/2
table['likelihood'] = 3/4, 1/2
\end{code}
This table shows the results so far:
\input{tables/table01-01}
If we multiply the priors by the likelihoods, the results are {\bf unnormalized posteriors}; they are proportional to the posterior probabilities, but they don't add up to 1.
We can normalize them by computing the total probability of the data and dividing through.
\begin{code}
table['unnorm'] = table['prior'] * table['likelihood']
prob_data = table['unnorm'].sum()
table['posterior'] = table['unnorm'] / prob_data
\end{code}
The following table shows the result:
\input{tables/table01-02}
The posterior probability for Bowl 1 is 0.6, which is what we got using Bayes's Theorem. As a bonus, we also get the posterior probability for Bowl 2, which is 0.4.
\section{The Dice Problem}
\label{dice}
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
I choose one of the dice at random, roll it, and report that the outcome is a 1.
What is the probability that I chose the 6-sided die?
In this example, there are three hypotheses with equal prior probabilities.
The data is my report that the outcome is a 1.
Under the hypothesis that I chose the 6-sided die, the probability of the data is 1/6.
If I chose the 8-sided die, the probability is 1/8, and if I chose the 12-sided die, it's 1/12.
Plugging the priors and likelihoods into a Bayes table, I get these results:
\input{tables/table01-03}
The posterior probability that I chose the 6-sided die is $4/9$.
As this example demonstrates, the table method works with more than two hypotheses.
\section{The Monty Hall problem}
\index{Monty Hall problem}
Monty Hall was the original host of the game show {\em Let's Make a
Deal}.
The Monty Hall problem is based on one of the regular
games on the show.
If you are a contestant, here's how the game works:
\begin{itemize}
\item Monty shows you three closed doors numbered 1, 2, and 3.
He tells you that there is a prize behind each door.
\item One prize is valuable (traditionally a car), the other two are less valuable (traditionally goats).
\item The object of the game is to guess which door has the car.
If you guess right, you get to keep the car.
\end{itemize}
Suppose you pick Door 1.
Before opening the door you chose, Monty opens Door 3 and reveals a
goat.
Then Monty offers you the option to stick with your original
choice or switch to the remaining unopened door.
To maximize your chance of winning the car, should you stick with Door 1 or switch to Door 2?
To answer this question, we have to make some assumptions about the behavior of the host:
\begin{enumerate}
\item Monty always opens a door and offers you the option to switch.
\item He never opens the door you picked or the door with the car.
\item If you choose the door with the car, he chooses one of the other doors at random.
\end{enumerate}
Under these assumptions, you are better off switching.
If you stick, you win $1/3$ of the time.
If you switch, you win $2/3$ of the time.
If you have not encountered this problem before, you might find the answer surprising.
You would not be alone; many people have the strong intuition that it doesn't matter if you stick or switch.
There are two doors left, they reason, so the chance that the car
is behind Door A is 50\%.
But that is wrong.
To see why, it might help to use a Bayes table.
We start with three hypotheses: the car might be behind Door 1, 2, or 3.
According to the statement of the problem, the prior probability for each door is 1/3.
The data is that Monty opened Door 3 and revealed a goat.
So let's consider the probability of the data under each hypothesis:
\begin{itemize}
\item If the car were behind Door 3, Monty would not have opened it, so the probability of the data under this hypothesis is 0.
\item If the car were behind Door 2, Monty would have to open Door 3, so the probability of the data under this hypothesis is 1.
\item If the car were behind Door 1, Monty would choose Door 2 or 3 at random; the probability he would open Door 3 is $1/2$.
\end{itemize}
Once we figure out prior probabilities and likelihoods, the Bayes table does the rest. Here is the result:
\input{tables/table01-04}
After Monty opens Door 3, the posterior probability of Door 1 is $1/3$; the posterior probability of Door 2 is $2/3$.
\index{divide-and-conquer}
As this example shows, our intuition for probability is not always reliable.
Bayes's Theorem provides a divide-and-conquer strategy that can help:
\begin{enumerate}
\item First, write down the hypotheses and the data.
\item Next, figure out the prior probabilities.
\item Finally, compute the likelihood of the data under each hypothesis.
\end{enumerate}
The Bayes table does the rest.
\section{Summary}
In this chapter...
In the next chapter
But first you might want to work on these exercises.
\section{Exercises}
The code for this chapter is in \py{chap01.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap01.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise}
Suppose you have two coins in a box.
One is a normal coin with heads on one side and tails on the other, and one is a trick coin with heads on both sides.
You choose a coin at random and see that one of the sides is heads.
What is the probability that you chose the trick coin?
\end{exercise}
\begin{exercise}
Suppose you meet someone and learn that they have two children.
You ask if either child is a girl and they say yes.
What is the probability that both children are girls?
Hint: Start with four equally likely hypotheses.
\end{exercise}
\begin{exercise}
There are many variations of the Monty Hall problem (see \url{https://en.wikipedia.org/wiki/Monty_Hall_problem}).
For example, suppose that Monty always chooses Door 2 if he can and
only chooses Door 3 if he has to (because the car is behind Door 2).
If you choose Door 1 and Monty opens Door 2, what is the probability the car is behind Door 3?
If you choose Door 1 and Monty opens Door 3, what is the probability the car is behind Door 2?
\end{exercise}
\newcommand{\MM}{M\&M}
\begin{exercise}
\MM's are small candy-coated chocolates that come in a variety of
colors. Mars, Inc., which makes \MM's, changes the mixture of
colors from time to time.
\index{M and M problem}
In 1995, they introduced blue \MM's. Before then, the color mix in
a bag of plain \MM's was 30\% Brown, 20\% Yellow, 20\% Red, 10\%
Green, 10\% Orange, 10\% Tan. Afterward it was 24\% Blue , 20\%
Green, 16\% Orange, 14\% Yellow, 13\% Red, 13\% Brown.
Suppose a friend of mine has two bags of \MM's, and he tells me
that one is from 1994 and one from 1996. He won't tell me which is
which, but he gives me one \MM~from each bag. One is yellow and
one is green. What is the probability that the yellow one came
from the 1994 bag?
\end{exercise}
\chapter{Computational Statistics}
\label{compstat}
\section{Distributions}
\label{distributions}
In statistics a {\bf distribution} is a set of values and their
corresponding probabilities.
\index{distribution}
For example, if you toss a coin, there are two possible outcomes with approximately equal probabilities.
If you roll a six-sided die, the set of possible
values is the numbers 1 to 6, and the probability associated
with each value is 1/6.
\index{dice}
To represent distributions, we'll use a library called \py{empiricaldist}.
An ``empirical'' distribution is based on data, as opposed to a theoretical distribution.
This library provides a class called \py{Pmf}, which represents
a {\bf probability mass function}.
\index{probability mass function}
\index{Pmf class}
\py{empiricaldist} is available from the Python Package Index (PyPI).
You can download it from \url{https://pypi.org/project/empiricaldist/} or install it with \py{pip}.
For more details, see Section~\ref{codeinfo}.
To use \py{Pmf} you can import it like this:
\begin{code}
from empiricaldist import Pmf
\end{code}
The following example makes a \py{Pmf} that represents the outcome of a coin toss.
\begin{code}
coin = Pmf()
coin['heads'] = 1/2
coin['tails'] = 1/2
\end{code}
The two outcomes have the same probability, $1/2$.
This example makes a \py{Pmf} that represents the distribution
of outcomes of a six-sided die:
\begin{code}
die = Pmf()
for x in [1,2,3,4,5,6]:
die[x] = 1
\end{code}
\py{Pmf} creates an empty \py{Pmf} with no values.
The \py{for} loop adds the values $1$ through $6$, each with ``probability'' $1$.
In this \py{Pmf}, the probabilities don't add up to 1, so they are not really probabilities.
We can use \py{normalize} to make them add up to 1.
\begin{code}
die.normalize()
\end{code}
Another way make a \py{Pmf} is to provide a sequence of values.
\begin{code}
die = Pmf.from_seq([1,2,3,4,5,6])
\end{code}
In this example, every value appears once, so they all have the same probability.
More generally, values can appear more than once, as in this example:
\begin{code}
letters = Pmf.from_seq(list('Mississippi'))
\end{code}
The following table shows the results.
\input{tables/table02-01}
The \py{qs} are the values or ``quantities'' in the distribution; the \py{ps} are the corresponding probabilities. In the word ``Mississippi'', about 36\% of the letters are ``s''.
The \py{Pmf} class inherits from a Pandas \py{Series}, so anything you can do with a \py{Series}, you can also do with a \py{Pmf}.
For example, you can use the bracket operator to look up a value and returns the corresponding probability.
\begin{code}
letters['s']
\end{code}
However, if you ask for the probability of a value that's not in the distribution, you get a \py{KeyError}.
You can also call a \py{Pmf} as if it were a function, with a value in parentheses.
\begin{code}
letters('s')
\end{code}
If the value is in the distribution the results are the same.
But if the value is not in the distribution, the result is $0$, not an error.
As these examples shows, the values in a \py{Pmf} can be integers or strings.
In general, they can be any type that can be stores in the index of a Pandas Series.
If you are familiar with Pandas, that will help you work with \py{Pmf} objects.
But I will explain what you need to know as we go along.
\section{The Cookie Problem}
In this section I'll use a \py{Pmf} to solve the cookie problem from Section~\ref{cookie}.
Here's the statement of the problem again:
\begin{quote}
Suppose there are two bowls of cookies.
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
Bowl 2 contains 20 of each.
Now suppose you choose one of the bowls at random and, without
looking, select a cookie at random. The cookie is vanilla. What is
the probability that it came from Bowl 1?
\end{quote}
Here's a \py{Pmf} that represents the two hypotheses and their prior probabilities:
\index{cookie problem}
\begin{code}
prior = Pmf.from_seq(['Bowl 1', 'Bowl 2'])
\end{code}
This distribution, which contains the prior probability for each hypothesis,
is called (wait for it) the {\bf prior distribution}.
\index{prior distribution}
To update the distribution based on new data (the vanilla cookie),
we multiply the priors by the likelihoods. The likelihood
of drawing a vanilla cookie from Bowl 1 is 3/4. The likelihood
for Bowl 2 is 1/2.
\begin{code}
likelihood_vanilla = [0.75, 0.5]
posterior = prior * likelihood_vanilla
\end{code}
The result is the unnormalized posteriors.
We can use \py{normalize} to compute the posterior probabilities:
\begin{code}
posterior.normalize()
\end{code}
The return value from \py{normalize} is the total probability of the data, which is $5/8$.
Finally, we can get the posterior probability for Bowl 1:
\begin{code}
posterior('Bowl 1')
\end{code}
And the answer is 0.6.
This distribution, which contains the posterior probability for each hypothesis, is called (wait now) the {\bf posterior distribution}.
\index{posterior distribution}
One benefit of using \py{Pmf} objects is that it is easy to do successive updates with more data.
For example, suppose you put the first cookie back (so the contents of the bowls don't change) and draw again from the same bowl.
If the second cookie is also vanilla, we can do a second update like this:
\begin{code}
posterior *= likelihood_vanilla
posterior.normalize()
\end{code}
Now the posterior probability for Bowl 1 is almost 70\%.
But suppose we do the same thing again and get a chocolate cookie.
Here's the update.
\begin{code}
likelihood_chocolate = [0.25, 0.5]
posterior *= likelihood_chocolate
posterior.normalize()
\end{code}
Now the posterior probability for Bowl 1 is about 53\%.
After two vanilla cookies and one chocolate, the posterior probabilities are close to 50/50.
\section{More Bowls}
\label{morebowls}
Next let's solve a cookie problem with 101 bowls:
\begin{itemize}
\item Bowl 0 contains no vanilla cookies,
\item Bowl 1 contains 1\% vanilla cookies,
\item Bowl 2 contains 2\% vanilla cookies,
\end{itemize}
and so on, up to
\begin{itemize}
\item Bowl 99 contains 99\% vanilla cookies, and
\item Bowl 100 contains all vanilla cookies.
\end{itemize}
As in the previous version, there are only two kinds of cookies, vanilla and chocolate. So Bowl 0 is all chocolate cookies, Bowl 1 is 99\% chocolate, and so on.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig02-01.pdf}}
\caption{Prior and posterior distributions for the 101 Bowls problem.}
\label{fig02-01}
\end{figure}
Suppose we choose a bowl at random, choose a cookie at random, and it turns out to be vanilla. What is the probability that the cookie came from Bowl \py{x}, for each value of \py{x}?
To solve this problem, I'll use \py{np.arange} to represent 101 hypotheses, numbered from 0 to 100.
\begin{code}
hypos = np.arange(101)
\end{code}
The result is a NumPy array, which we can use to make the prior distribution:
\begin{code}
prior = Pmf(1, hypos)
prior.normalize()
\end{code}
As this example shows, we an initialize a \py{Pmf} with two parameters.
The first parameter is the prior probability; the second parameter is a sequence of values.
Because the probabilities are all the same, we only have to provide one of them.
It gets ``broadcast'' across the hypotheses.
Since all hypotheses have the same prior probability, this distribution is {\bf uniform}.
The likelihood of the data is the fraction of vanilla cookies in each bowl, which we can calculate using \py{hypos}:
\begin{code}
likelihood_vanilla = hypos/100
\end{code}
Now we can compute the posterior distribution in the usual way:
\begin{code}
posterior1 = prior * likelihood_vanilla
posterior1.normalize()
\end{code}
Figure~\ref{fig02-01} (top) shows the prior distribution and the posterior distribution after one vanilla cookie.
Bowl 0 has been eliminated, because it contains no vanilla cookies, and Bowl 100 is the most likely.
The posterior distribution is a line because the the likelihoods are proportional to the bowl numbers.
Now suppose we put the cookie back, draw again from the same bowl, and get another vanilla cookie.
Here's the update after the second cookie:
\begin{code}
posterior2 = posterior1 * likelihood_vanilla
posterior2.normalize()
\end{code}
Figure~\ref{fig02-01} (middle) shows the result.
Because the likelihood function is a line, the posterior after two cookies is a parabola.
At this point the high-numbered bowls are the most likely because they contain the most vanilla cookies, and the low-numbered bowls have been all but eliminated.
But suppose we draw again and get a chocolate cookie.
Here's the update:
\begin{code}
likelihood_chocolate = 1 - hypos/100
posterior3 = posterior2 * likelihood_chocolate
posterior3.normalize()
\end{code}
Figure~\ref{fig02-01} (bottom) shows the result.
Now Bowl 100 has been eliminated because it contains no chocolare cookies.
But the high-numbered bowls are still more likely than the low-numbered bowls, because we have seen more vanilla cookies than chocolate.
In fact, the peak of the posterior distribution is at Bowl 67, which corresponds to the fraction of vanilla cookies in the data we've observed, $2/3$.
The quantity with the highest posterior probability is called the {\bf MAP}, which stands for ``maximum a posteori probability'', where ``a posteori'' is unnecessary Latin for ``posterior''.
To compute the MAP, we can use the \py{Series} method \py{idxmax}:
\begin{code}
posterior3.idxmax()
\end{code}
Or \py{Pmf} provides a more memorable name for the same thing:
\begin{code}
posterior3.max_prob()
\end{code}
As you might suspect, this example isn't really about bowls; it's about estimating proportions.
Imagine that you have one bowl of cookies.
You don't know what fraction of cookies are vanilla, but you think it is equally likely to be any fraction from 0 to 1.
If you draw three cookies and two are vanilla, what proportion of cookies in the bowl do you think are vanilla?
The posterior distribution we just computed is the answer to that question.
We'll come back to estimating proportions in the next chapter.
But first let's use a \py{Pmf} to solve the dice problem.
\section{The Dice Problem}
In Section~\ref{dice} we solved the dice problem using a Bayes table.
Here's the statment of the problem again:
\begin{quote}
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
I choose one of the dice at random, roll it, and report that the outcome is a 1.
What is the probability that I chose the 6-sided die?
\end{quote}
Let's solve it again using a \py{Pmf}.
I'll use integers to represent the hypotheses:
\begin{code}
hypos = [6, 8, 12]
\end{code}
And I can make the prior distribution like this:
\begin{code}
prior = Pmf(1/3, hypos)
\end{code}
As in the previous example, the prior probability gets broadcast across the hypotheses.
Now we can compute the likelihood of the data:
\begin{code}
likelihood1 = 1/6, 1/8, 1/12
\end{code}
And use it to compute the posterior distribution.
\begin{code}
posterior = prior * likelihood1
posterior.normalize()
\end{code}
Here's the result:
\input{tables/table02-02}
The posterior probability for the 6-sided die is $4/9$.
Now suppose I roll the same die again and get a $7$.
We can do a second update like this:
\begin{code}
likelihood2 = 0, 1/8, 1/12
posterior *= likelihood2
posterior.normalize()
\end{code}
The likelihood for the 6-sided die is $0$ because it is not possible to get a 7 on a 6-sided die.
The other two likelihoods are the same as in the previous update.
And here's the result:
\input{tables/table02-03}
After rolling a 1 and a 7, the posterior probability of the 8-sided die is about 69\%.
\section{Updating Dice}
\label{dice2}
The following function is a more general version of the update in the previous section:
\begin{code}
def update_dice(pmf, data):
hypos = pmf.qs
likelihood = 1 / hypos
impossible = (data > hypos)
likelihood[impossible] = 0
pmf *= likelihood
pmf.normalize()
\end{code}
The first parameter is a \py{Pmf} that represents the possible dice and their probabilities.
The second parameter is the outcome of rolling a die.
The first line selects \py{qs} from the \py{Pmf}, which is the index of the \py{Series}; in this example, it represents the hypotheses.
Since the hypotheses are integers, we can use them to compute the likelihoods.
In general, if there are \py{n} sides on the die, the probability of any possible outcome is \py{1/n}.
However, we have to check for impossible outcomes!
If the outcome exceeds the hypothetical number of sides on the die, the probability of that outcome is $0$.
\py{impossible} is a Boolean Series that is \py{True} for each impossible die.
I use it as an index into \py{likelihood} to set the corresponding probabilities to $0$.
Finally, I multiply \py{pmf} by the likelihoods and normalize.
Here's how we can use this function to compute the updates in the previous section:
\begin{code}
pmf = prior.copy()
update_dice(pmf, 1)
update_dice(pmf, 7)
\end{code}
I start with a fresh copy of the prior distribution and use \py{update_dice} to do the updates.
The result is the same.
\section{Summary}
This chapter introduces the \py{empiricaldist} module, which provides \py{Pmf}, which we use to represent a set of hypotheses and their probabilities.
We use a \py{Pmf} to solve the cookie problem and the dice problem, which we saw in the previous chapter.
With a \py{Pmf} it is easy to perform sequential updates as we see multiple pieces of data.
We also solved a more general version of the cookie problem, with 101 bowls rather than two.
Then we computed the MAP, which is the quantity with the highest posterior probability.
In the next chapter ...
But first you might want to work on the exercises.
\section{Exercises}
\label{elvis}
The code for this chapter is in \py{chap02.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap02.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise}
\end{exercise}
\begin{exercise}
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
I choose one of the dice at random, roll it four times, and get 1, 3, 5, and 7.
What is the probability that I chose the 8-sided die?
\end{exercise}
\begin{exercise}
In the previous version of the dice problem, the prior probabilities are the same because the box contains one of each die.
But suppose the box contains 1 die that is 4-sided, 2 dice that are 6-sided, 3 dice that are 8-sided, 4 dice that are 12-sided, and 5 dice that are 20-sided.
I choose a die, roll it, and get a 7. What is the probability that I chose an 8-sided die?
\end{exercise}
\begin{exercise}
Suppose I have two sock drawers.
One contains equal numbers of black and white socks.
The other contains equal numbers of red, green, and blue socks.
Suppose I choose a drawer and random, choose two socks at random, and I tell you that I got a matching pair.
What is the probability that the socks are white?
For simplicity, let's assume that there are so many socks in both drawers that removing one sock makes a negligible change to the proportions.
\end{exercise}
\begin{exercise}
Here's a problem from {\it Bayesian Data Analysis}, which is available from \url{http://www.stat.columbia.edu/~gelman/book}:
\begin{quote}
Elvis Presley had a twin brother (who died at birth). What is the probability that Elvis was an identical twin?
\end{quote}
Hint: In 1935, about 2/3 of twins were fraternal and 1/3 were identical.
\end{exercise}
\chapter{Estimation}
\label{more}
\section{The Euro problem}
\label{euro}
\index{Euro problem}
\index{MacKay, David}
In {\it Information Theory, Inference, and Learning Algorithms}, David MacKay poses this problem:
\begin{quote}
A statistical statement appeared in ``The Guardian'' on Friday January 4, 2002:
\begin{quote}
When spun on edge 250 times, a Belgian one-euro coin came
up heads 140 times and tails 110. `It looks very suspicious
to me,' said Barry Blight, a statistics lecturer at the London
School of Economics. `If the coin were unbiased, the chance of
getting a result as extreme as that would be less than 7\%.'
\end{quote}
But do these data give evidence that the coin is biased rather than fair?
\end{quote}
To answer that question, we'll proceed in two steps.
First we'll use the binomial distribution to see where that 7\% came from; then we'll use Bayes's Theorem to estimate the probability that this coin comes up heads.
\section{The Binomial Distribution}
\label{binomial}
Suppose we have a coin that we know is fair; if we spin it once, the possible outcomes are heads and tails with equal probability.
I'll denote these outcomes \py{H} and \py{T}.
If you spin it twice, there are four outcomes with equal probability: \py{HH}, \py{HT}, \py{TH}, and \py{TT}.
If we add up the total number of heads, there are three possible outcomes: 0, 1, or 2. The probability of 0 and 2 is 25\%, and the probability of 1 is 50\%.
More generally, suppose the probability of heads is \py{p} and we spin the coin \py{n} times. What is the probability that we get a total of \py{k} heads?
The answer is given by the binomial distribution:
\[ P(k; n, p) = \binom{n}{k} p^k (1-p)^{n-k} \]
where $\binom{n}{k}$ is the {\bf binomial coefficient}, usually pronounced "n choose k" (see \url{https://en.wikipedia.org/wiki/Binomial_coefficient}).
We can compute this expression ourselves, but we can also use the SciPy function \py{binom.pmf}:
\begin{code}
from scipy.stats import binom
n = 2
p = 0.5
ks = np.arange(n+1)
a = binom.pmf(ks, n, p)
\end{code}
The return value is a NumPy array.
If we put it in a \py{Pmf}, the result is the distribution of \py{k} for the given values of \py{n} and \py{p}.
\begin{code}
pmf_k = Pmf(a, ks)
\end{code}
Here's what it looks like:
\input{tables/table02-01}
We can do the same calculation with \py{n=250}; Figure~\ref{fig03-01} shows the result.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig03-01.pdf}}
\caption{Binomial distribution with \py{n=250} and \py{p=0.5}}
\label{fig03-01}
\end{figure}
The most likely outcome is 125, which is \py{n*p}.
But the probability of getting exactly 125 heads is only about 5\%.
The probability of getting 140 heads, as in the Euro problem is lower, around 0.8\%, but it is still possible even if the coin is fair.
In the article MacKay quotes, the statistician says, ``If the coin were unbiased the chance of getting a result as extreme as that would be less than 7\%''.
We can use the binomial distribution to check his math. The following function takes a PMF and computes the total probability of values greater than or equal to \py{threshold}.
\begin{code}
def ge_dist(pmf, threshold):
ge = (pmf.index >= threshold)
total = pmf[ge].sum()
return total
\end{code}
We can call it like this:
\begin{code}
ge_dist(pmf_k, 140)
\end{code}
Or \py{Pmf} provides a function that computes the same thing:
\begin{code}
pmf_k.ge_dist(140)
\end{code}
Either way, the probability is about 3.3\% that we get 140 heads or more.
But that's less than 7
The reason is that the statistician includes all values ``as extreme as'' 140, which includes values less than or equal to 110, because 140 exceeds the expected value by 15 and 110 falls short by 15.
The probability of values less than or equal to 110 is also 3.3\%,
so the total probability of values ``as extreme'' as 140 is 6.6\%.
The point of this calculation is that these extreme values are unlikely if the coin is fair.
And that's why the statistician concludes that the results are ``very suspicious''.
That's interesting, but it doesn't answer MacKay's question. So let's move on to the next step, estimating the proportion of heads.
\section{Estimating Proportions}
\label{estprop}
Any given coin has some probability of landing heads up when spun
on edge; I'll call this probability \py{x}.
It seems reasonable to believe that \py{x} depends
on physical characteristics of the coin, like the distribution
of weight.
If a coin is perfectly balanced, we expect \py{x} to be close to 50\%, but
for a lopsided coin, \py{x} might be substantially different. We can use
Bayes's theorem and the observed data to estimate \py{x}.
For simplicity, I'll start with a uniform prior, which assume that all values of \py{x} are equally likely.
That might not be a reasonable assumption, so we'll come back and consider other priors later.
Here's the uniform prior:
\begin{code}
hypos = np.arange(0, 101)
prior = Pmf(1, hypos)
\end{code}
And here are the likelihoods:
\begin{code}
likelihood = {
'H': hypos/100,
'T': 1 - hypos/100
}
\end{code}
I put the likelihoods for heads and tails in a dictionary to make it easier to do the update.
To represent the data, I'll use string where each element is \py{H} or \py{T}:
\begin{code}
dataset = 'H' * 140 + 'T' * 110
\end{code}
The following function does the update.
\begin{code}
def update_euro(pmf, dataset):
for data in dataset:
pmf *= likelihood[data]
pmf.normalize()
\end{code}
The first argument is a \py{Pmf} that represents the prior.
The second argument is a list of strings.
Each time through the loop, we multiply \py{pmf} by the likelihood of one outcome, heads or tails.
Notice that \py{normalize} is outside the loop, so the posterior distribution only gets normalized one, at the end.
That's more efficient than normalizing it after each spin (although we'll see later that it can also cause problems with floating-point arithmetic).
Here's how we do the update:
\begin{code}
posterior = prior.copy()
update_euro(posterior, dataset)
\end{code}
Figure~\ref{fig03-02} shows the posterior distribution of \py{x}.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig03-02.pdf}}
\caption{Posterior distribution of \py{x} after 140 heads in 250 spins.}
\label{fig03-02}
\end{figure}
Now, it's easy to get this distribution mixed up with the previous one, but rememeber:
\begin{itemize}
\item Figure~\ref{fig03-01} shows the distribution of \py{k}, which is the number of heads we get with \py{n=250} and \py{p=0.5}.
\item Figure ~\ref{fig03-02} shows the posterior distribution of \py{x} which is the proportion of heads for the coin we observed.
\end{itemize}
The posterior distribution represents our beliefs about \py{x} after seeing the data.
It indicates that values less than 40 and greater than 80 are unlikely; values between 50 and 60 are the most likely.
In fact, the most likely value for \py{x} is 56\% which is the proportion of heads in the dataset, \py{140/250}.
\section{Triangle Prior}
\label{triangle}
So far we've been using a uniform prior, but that might not be a reasonable choice based on what we know about coins.
I can believe that if a coin is lopsided, \py{x} might deviate substantially from 50\%, but it seems unlikely that the Belgian Euro coin is so imbalanced that \py{x} is 10\% or 90\%.
It might be more reasonable to choose a prior that gives
higher probability to values of \py{x} near 50\% and lower probability
to extreme values.
\index{triangle distribution}
As an example, let's try a triangule-shaped prior.
Here's the code that constructs it:
\begin{code}
ramp_up = np.arange(50)
ramp_down = np.arange(50, -1, -1)
a = np.append(ramp_up, ramp_down)
triangle = Pmf(a, hypos, name='triangle')
triangle.normalize()
\end{code}
\py{arange} returns a NumPy array, so we can use \py{np.append} to append \py{ramp_down} to the end of \py{ramp_up}.
Then we use \py{a} and \py{hypos} to make a \py{Pmf}.
Figure~\ref{fig03-03} shows the result, along with the uniform distribution.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig03-03.pdf}}
\caption{Uniform and trianlge-shaped prior distributions.}
\label{fig03-03}
\end{figure}
Now we can update both priors with the same data:
\begin{code}
update_euro(uniform, dataset)
update_euro(triangle, dataset)
\end{code}
Figure~\ref{fig03-04} shows the posterior distributions.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig03-04.pdf}}
\caption{Posterior distributions based on uniform and triangle priors.}
\label{fig03-04}
\end{figure}
The differences between the posterior distributions are barely visible, and so small they would hardly matter in practice.
And that's good news.
To see why, imagine two people who disagree angrily about which prior is better, uniform or triangle.
Each of them has reasons for their preference, but neither of them can persuade the other to change their mind.
But suppose they agree to use the data to update their beliefs.
When they compare their posterior distributions, they find that there is almost nothing left to argue about.
This is an example of {\bf swamping the priors}: with enough
data, people who start with different priors will tend to
converge on the same posterior distribution.
\index{swamping the priors}
\index{convergence}
\section{Binomial Likelihood}
\label{binomlike}
So far we've been computing the updates one spin at a time, so for the Euro problem we have to do 250 updates.
A more efficient alternative is to compute the likelihood of the entire dataset at once.
For each hypothetical value of \py{x}, we have to compute the probability of getting 140 heads out of 250 spins.
Well, we know how to do that; this is the question the binomial distribution answers.
If the probability of heads is $p$, the probability of $k$ heads in $n$ spins is:
\[ P(k; n, p) = \binom{n}{k} p^k (1-p)^{n-k} \]
And we can use SciPy to compute it.
The following function takes a \py{Pmf} that represents a prior distribution and a tuple of integers, \py{k} and \py{n}:
\begin{code}
from scipy.stats import binom
def update_binomial(pmf, data):
k, n = data
xs = pmf.qs
likelihood = binom.pmf(k, n, xs)
pmf *= likelihood
pmf.normalize()
\end{code}
It extracts the hypothetical values of \py{x} from the \py{Pmf} and passes them to \py{binom.pmf}, which computes the binomial PMF for the given values of \py{k} and \py{n}, and all values of \py{x}.
Here's how we use it:
\begin{code}
uniform2 = Pmf(1, hypos)
data = 140, 250
update_binomial(uniform2, data)
\end{code}
The result is the same as in Section~\ref{estprop} except for a small floating-point round-off.
But it's much more efficient.
\section{Bayesian Statistics}
You might have noticed similarities between the Euro problem and the 101 bowls problem in Section~\ref{morebowls}.
The prior distributions are the same, the likelihoods are the same, and with the same data the results would be the same.
But there are two differences.
The first is the choice of the prior.
In the 101 bowls problem, the uniform prior is implied by the statement of the problem, which says that we choose one of the bowls at random with equal probability.
In the Euro problem, the choice of the prior is subjective; that is, reasonable people could disagree, maybe because they have different information about coins or because they interpret the same information differently.
Because the priors are subjective, the posteriors are subjective, too.
And some people find that problematic.
The other difference is the nature of what we are estimating.
In the 101 bowls problem, we choose the bowl randomly, so it is uncontroversial to compute the probability of choosing each bowl.
In the Euro problem, the proportion of heads is a physical property of a given coin.
Under some interpretations of probability, that's a problem because physical properties are not considered random.
As an example, consider the age of the universe.
Currently, our best estimate is 13.80 billion years, but it might be off by 0.02 billion years in either direction (see \url{https://en.wikipedia.org/wiki/Age_of_the_universe}).
Now suppose we would like to know the probability that the age of the universe is actually greater than 13.81 billion years.
Under some interpretations of probability, we would not be able to answer that question.
We would be required to say something like, ``The age of the universe is not a random quantity, so it has no probability of exceeding a particular value.''
Under the Bayesian interpretation of probability, it is meaningful and useful to treat physical quantities as if they were random and compute probabilities about them.
In the Euro problem, the prior distribution represents what we believe about coins in general and the posterior distribution represents what we believe about a particular coin after seeing the data.
So we can use the posterior distribution to compute probabilities about the coin and its proportion of heads.
The subjectivity of the prior and the interpretation of the posterior are key differences between Bayes's Theorem and Bayesian statistics.
Bayes's Theorem is a mathematical law of probability; no reasonable person objects to it.
But Bayesian statistics is surprisingly controversial.
Historically, many people have been bothered by its subjectivity and its use of probability for things that are not random.
If you are interested in this history, I recommend Sharon Bertsch McGrayne's book, {\it The Theory That Would Not Die} (\url{https://yalebooks.yale.edu/book/9780300188226/theory-would-not-die}).
\index{McGrayne, Sharon Bertsch}
\index{The Theory That Would Not Die}
\section{Summary}
In this chapter I posed David MacKay's Euro problem and we started to solve it.
Given the data, we computed the posterior distribution for \py{x}, the probability a Euro coin comes up heads.
We tried two different priors, updated them with the same data, and found that the posteriors were nearly the same.
This is good news, because it suggests that if two people start with different beliefs and see the same data, their beliefs tend to converge.
This chapter introduces the binomial distribution, which we used to compute the posterior distribution more efficiently.
And I discussed the difference between applying Bayes's Theorem, as in the 101 bowls problem, and computing Bayesian statistics, as in the Euro problem.
\index{convergence}
However, we still haven't answered MacKay's question: ``Do these data give evidence that the coin is biased rather than fair?''
I'm going to leave this question hanging a little longer; we'll come back to it in Chapter~\ref{hypotest}.
In the next chapter, I want to get back to the dice problem.
\section{Exercises}
The code for this chapter is in \py{chap03.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap03.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise}
In Major League Baseball, most players have a batting average between 200 and 330, which means that the probability of getting a hit is between 0.2 and 0.33.
Suppose a new player appearing in his first game gets 3 hits out of 3 attempts. What is the posterior distribution for his probability of getting a hit?
\end{exercise}
\begin{exercise}
Whenever you survey people about sensitive issues, you have to deal with ``social desirability bias'', which is the tendency of people to shade their answers to show themselves in the most positive light (see \url{https://en.wikipedia.org/wiki/Social_desirability_bias}).
One of the ways to improve the accuracy of the results is ``randomized response'' (see \url{https://en.wikipedia.org/wiki/Randomized_response}).
As an example, suppose you ask 100 people to flip a coin and:
\begin{itemize}
\item If they get heads, they report YES.
\item If they get tails, they honestly answer the question ``Do you cheat on your taxes?''
\end{itemize}
And suppose you get 80 YESes and 20 NOs. Based on this data, what is the posterior distribution for the fraction of people who cheat on their taxes? What is the most likely value in the posterior distribution?
\end{exercise}
\begin{exercise}
Suppose that instead of observing coin spins directly, you measure the outcome using an instrument that is not always correct. Specifically, suppose the probability is \py{y=0.2} that an actual heads is reported
as tails, or actual tails reported as heads.
If we spin a coin 250 times and the instrument reports 140 heads, what is the posterior distribution of \py{x}?
What happens as you vary the value of \py{y}?
\end{exercise}
\begin{exercise}
In preparation for an alien invasion, the Earth Defense League (EDL) has been working on new missiles to shoot down space invaders. Of course, some missile designs are better than others; let's assume that each design has some probability of hitting an alien ship, \py{x}.
Based on previous tests, the distribution of \py{x} in the population of designs is approximately uniform between 0.1 and 0.4.
Now suppose the new ultra-secret Alien Blaster 9000 is being tested. In a press conference, an EDL general reports that the new design has been tested twice, taking two shots during each test. The results of the test are confidential, so the general won't say how many targets were hit, but they report: ``The same number of targets were hit in the two tests, so we have reason to think this new design is consistent.''
Is this data good or bad; that is, does it increase or decrease your estimate of \py{x} for the Alien Blaster 9000?
Hint: If the probability of hitting each target is $x$, the probability of hitting one target in both tests is $[2x(1-x)]^2$.
\end{exercise}
\chapter{More Estimation}
\label{estimation}
\section{The train problem}
\index{train problem}
\index{Mosteller, Frederick}
\index{German tank problem}
I found the train problem
in Frederick Mosteller's, {\it Fifty Challenging Problems in
Probability with Solutions} (\url{https://store.doverpublications.com/0486653552.html}):
\begin{quote}
``A railroad numbers its locomotives in order $1..N$. One day you see a
locomotive with the number 60. Estimate how many locomotives the
railroad has.''
\end{quote}
Based on this observation, we know the railroad has 60 or more
locomotives. But how many more? To apply Bayesian reasoning, we
can break this problem into two steps:
\begin{enumerate}
\item What did we know about $N$ before we saw the data?
\item For any given value of $N$, what is the likelihood of
seeing the data (a locomotive with number 60)?
\end{enumerate}
The answer to the first question is the prior. The answer to the
second is the likelihood.
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/train1.pdf}}
\caption{Posterior distribution for the locomotive problem, based
on a uniform prior.}
\label{fig.train1}
\end{figure}
We don't have much basis to choose a prior, so we'll start with
something simple and then consider alternatives.
Let's assume that $N$ is equally likely to be any value from 1 to 1000.
\begin{code}
hypos = np.arange(1, 1001)
prior = Pmf(1, hypos)
\end{code}
Now let's figure out the likelihood of the data.
In a hypothetical fleet of $N$ locomotives, what is the probability that we would see number 60?
If we assume that we are equally likely to see any locomotive, the chance of seeing any particular one is $1/N$.
Here's the function that does the update:
\begin{code}
def update_train(pmf, data):
hypos = pmf.qs
likelihood = 1 / hypos
impossible = (data > hypos)
likelihood[impossible] = 0
pmf *= likelihood
pmf.normalize()
\end{code}
The first parameter is a \py{Pmf} that represents the possible values of $N$ and their probabilities.
The second parameter is the number of the train we observed.
This function might look familiar; it is the same as the update function for the dice problem in Section~\ref{dice2}.
\index{dice problem}
Here's the update:
\begin{code}
data = 60
posterior = prior.copy()
update_train(posterior, data)
\end{code}
Figure~\ref{fig04-01} shows the results.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig04-01.pdf}}
\caption{Posterior distribution of the number of trains, $N$, after seeing train number 60.}
\label{fig04-01}
\end{figure}
Not surprisingly, all values of $N$ below 60 have been eliminated.
The most likely value, if you had to guess, is 60.
That might not seem like a very good guess; after all, what are the chances that you just happened to see the train with the highest number?
Nevertheless, if you want to maximize the chance of getting
the answer exactly right, you should guess 60.
But maybe that's not the right goal.
An alternative is to compute the mean of the posterior distribution.
Given a set of possible quantities, $q_i$, and their probabilities, $p_i$, the mean of the distribution is:
\[ \mathrm{mean} = \sum_i p_i q_i \]
Which we can compute like this:
\begin{code}
np.sum(posterior.ps * posterior.qs)
\end{code}
Or we can use the method provided by \py{Pmf}:
\begin{code}
posterior.mean()
\end{code}
The mean of the posterior is 333, so that might be a good guess if you want to minimize error.
If you played this guessing game over and over, using the mean of the posterior as your estimate would minimize the mean squared error over the long run (see \url{http://en.wikipedia.org/wiki/Minimum_mean_square_error}).
\index{mean squared error}
\section{What about that prior?}
The prior I chose in the previous section is uniform from 1 to 1000, but I offered no justification for choosing a uniform distribution or that particular upper bound.
\index{prior distribution}
We might wonder whether the posterior distribution is sensitive to the prior.
With so little data---only one observation---it is:
\begin{itemize}
\item With a uniform prior from 1 to 500, the posterior mean is 207.
\item With an upper bound of 1000, it's 333.
\item With an upper bound of 2000, it's 552.
\end{itemize}
So that's bad.
When the posterior is sensitive to the prior, there are two ways to proceed:
\begin{itemize}
\item Get more data.
\item Get more background information and choose a better prior.
\end{itemize}
With more data, posterior distributions based on different
priors tend to converge.
For example, suppose that in addition
to train 60 we also see trains 30 and 90.
We can update the distribution like this:
\begin{code}
for data in [30, 60, 90]:
update_train(pmf, data)
\end{code}
With these data, the means of the posteriors are
\begin{tabular}{r r}
\toprule
Upper & Posterior \\
Bound & Mean \\
\midrule
500 & 152 \\
1000 & 164\\
2000 & 171\\
\bottomrule
\end{tabular}
The differences are smaller, but apparently three trains is not enough for the posteriors to converge.
\section{Another prior}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/train4.pdf}}
\caption{Posterior distribution based on a power law prior,
compared to a uniform prior.}
\label{fig.train4}
\end{figure}
If more data are not available, another option is to improve the
priors by gathering more background information.
It is probably not reasonable to assume that a train-operating company with 1000 locomotives is just as likely as a company with only 1.
With some effort, we could probably find a list of companies that
operate locomotives in the area of observation.
Or we could interview an expert in rail shipping to gather information about the typical size of companies.
But even without getting into the specifics of railroad economics, we
can make some educated guesses.
In most fields, there are many small
companies, fewer medium-sized companies, and only one or two very
large companies.
In fact, the distribution of company sizes tends to
follow a power law, as Robert Axtell reports in {\it Science} (see
\url{https://sci-hub.tw/10.1126/science.1062081}).
\index{power law}
\index{Axtell, Robert}
This law suggests that if there are 1000 companies with fewer than
10 locomotives, there might be 100 companies with 100 locomotives,
10 companies with 1000, and possibly one company with 10,000 locomotives.
Mathematically, a power law means that the number of companies
with a given size is inversely proportional to size, or
\[ \PMF(N) \sim \left( \frac{1}{N} \right)^{\alpha} \]
where $\PMF(N)$ is the probability mass function of $N$ and $\alpha$ is
a parameter that is often near 1.
We can construct a power law prior like this:
\begin{code}
alpha = 1.0
hypos = np.arange(1, 1001)
ps = hypos**(-alpha)
power = Pmf(ps, hypos, name='power law')
power.normalize()
\end{code}
Again, the upper bound is arbitrary, but with a power law prior, the posterior is less sensitive to this choice.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig04-02.pdf}}
\caption{Posterior distributions for the uniform and power law priors
after seeing train 60.}
\label{fig04-02}
\end{figure}
Figure~\ref{fig04-02} shows the new posterior based on the power law prior, compared to the posterior based on the uniform prior, both after seeing train number 60.
With the power law prior, the posterior is less sensitive to the choice of the upper bound.
If we observe trains 30, 60, and 90, the means of the posteriors are
\begin{tabular}{rr}
\toprule
Upper & Posterior \\
Bound & Mean \\
\midrule
500 & 131 \\
1000 & 133 \\
2000 & 134 \\
\bottomrule
\end{tabular}
Now the differences are much smaller. In fact,
with an arbitrarily large upper bound, the mean converges on 134.
So the power law prior is more realistic, because it is based on
general information about the size of companies, and it behaves better in practice.
\section{Credible intervals}
\label{credible}
So far we have seen two ways to summarize a posterior distribution: the value with the highest posterior probability (the MAP) and the posterior mean.
These are both {\bf point estimates}, that is, single values that estimate the quantity we are interested in.
Another way to summarize posterior distribution is with percentiles.
If you have taken a standardized test, you might be familiar with percentiles.
For example, if your score is the 90th percentile, that means you did as well as or better than 90\% of the people who took the test.
If we are given a value, \py{x}, we can compute its {\bf percentile rank} by finding all values less than or equal to \py{x} and adding up their probabilities.
\py{Pmf} provides a method that does this computation.
So, for example, we can compute the probability that the company has less than or equal to 100 trains:
\begin{code}
power.lt_dist(100)
\end{code}
With a power law prior and a dataset of three trains, the result is about 27\%.
So 100 trains is the 27th percentile.
Going the other way, suppose we want to compute a particular percentile; for example, the median of a distriution is the 50th percentile.
We can compute it by adding up probabilities until the total exceeds 0.5.
Here's a function that does it:
\begin{code}
def quantile(pmf, prob):
total = 0
for q, p in pmf.items():
total += p
if total >= prob:
return q
return np.nan
\end{code}
\py{pmf} represents a normalized distribution.
\py{prob} is the probability of the percentile we want to compute.
The loop uses \py{items}, which iterates the quantities and probabilities in the distribution.
Inside the loop we add up the probabilities of the quantities in order.
When the total equals or exceeds \py{prob}, we return the corresponding quantity.
This function is called \py{quantile} because it computes a quantile rather than a percentile.
The difference is the way we specify \py{prob}.
If \py{prob} is a percentage between 0 and 100, we call the corresponding quantity a percentile.
If \py{prob} is a probability between 0 and 1, we call the corresponding quantity a {\bf quantile}.
Here's how we can use this function to compute the median of the posterior distribution:
\begin{code}
quantile(power, 0.5)
\end{code}
The result, 113 trains, is the median of the posterior distribution.
\py{Pmf} provides a method called \py{quantile} that does the same thing.
We can call it like this to compute the 5th and 9th percentiles:
\begin{code}
power.quantile([0.05, 0.95])
\end{code}
The result is the interval from 91 to 242 trains, which implies:
\begin{itemize}
\item The probability is 5\% that the number of trains is less than or equal to 91.
\item The probability is 5\% that the number of trains is greater than 242.
\end{itemize}
Therefore the probability is 90\% that the number of trains falls between 91 and 242 (excluding 91 and including 242).
For this reason, this interval is called a 90\% {\bf credible interval}.
\py{Pmf} also provides \py{credible_interval}, which computes an interval that contains the given probability.
\begin{code}
power.credible_interval(0.9)
\end{code}
\section{The German tank problem}
During World War II, the Economic Warfare Division of the American
Embassy in London used statistical analysis to estimate German
production of tanks and other equipment.\footnote{Ruggles and Brodie,
``An Empirical Approach to Economic Intelligence in World War II,''
{\em Journal of the American Statistical Association}, Vol. 42,
No. 237 (March 1947).}
The Western Allies had captured log books, inventories, and repair
records that included chassis and engine serial numbers for individual
tanks.
Analysis of these records indicated that serial numbers were allocated
by manufacturer and tank type in blocks of 100 numbers, that numbers
in each block were used sequentially, and that not all numbers in each
block were used. So the problem of estimating German tank production
could be reduced, within each block of 100 numbers, to a form of the
locomotive problem.
Based on this insight, American and British analysts produced
estimates substantially lower than estimates from other forms
of intelligence. And after the war, records indicated that they were
substantially more accurate.
They performed similar analyses for tires, trucks, rockets, and other
equipment, yielding accurate and actionable economic intelligence.
The German tank problem is historically interesting; it is also a nice
example of real-world application of statistical estimation. So far
many of the examples in this book have been toy problems, but it will
not be long before we start solving real problems. I think it is an
advantage of Bayesian analysis, especially with the computational
approach we are taking, that it provides such a short path from a
basic introduction to the research frontier.
\section{Informative priors}
Among Bayesians, there are two approaches to choosing prior
distributions. Some recommend choosing the prior that best represents
background information about the problem; in that case the prior
is said to be {\bf informative}. The problem with using an informative
prior is that people might use different background information (or
interpret it differently). So informative priors often seem subjective.
\index{informative prior}
The alternative is a so-called {\bf uninformative prior}, which is
intended to be as unrestricted as possible, in order to let the data
speak for themselves. In some cases you can identify a unique prior
that has some desirable property, like representing minimal prior
information about the estimated quantity.
\index{uninformative prior}
Uninformative priors are appealing because they seem more
objective. But I am generally in favor of using informative priors.
Why? First, Bayesian analysis is always based on
modeling decisions. Choosing the prior is one of those decisions, but
it is not the only one, and it might not even be the most subjective.
So even if an uninformative prior is more objective, the entire analysis
is still subjective.
\index{modeling}
\index{subjectivity}
\index{objectivity}
Also, for most practical problems, you are likely to be in one of two
regimes: either you have a lot of data or not very much. If you have
a lot of data, the choice of the prior doesn't matter very much;
informative and uninformative priors yield almost the same results.
We'll see an example like this in the next chapter.
But if, as in the locomotive problem, you don't have much data,
using relevant background information (like the power law distribution)
makes a big difference.
\index{locomotive problem}
And if, as in the German tank problem, you have to make life-and-death
decisions based on your results, you should probably use all of the
information at your disposal, rather than maintaining the illusion of
objectivity by pretending to know less than you do.
\index{German tank problem}
\section{Exercises}
The code for this chapter is in \py{chap04.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap04.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise}
Suppose you are giving a talk in a large lecture hall and you want to estimate the number of people in the audience. There are too many to count, so you ask how many people were born on May 11 and two people raise their hands. You ask how many were born on May 23 and 1 person raises their hand. Finally, you ask how many were born on August 1, and no one raises their hand.
How many people are in the audience? What is the 90\% credible interval for your estimate? Hint: Remember the binomial distribution.
\end{exercise}
\begin{exercise}
I often see rabbits in the garden behind my house, but it's not easy to tell them apart, so I don't really know how many there are.
Suppose I deploy a motion-sensing camera trap that takes a picture of the first rabbit it sees each day. After three days, I compare the pictures and conclude that two of them are the same rabbit and the other is different.
How many rabbits visit my garden?
To answer this question, we have to think about the prior distribution and the likelihood of the data:
\begin{itemize}
\item I have sometimes seen four rabbits at the same time, so I know there are at least that many. I would be surprised if there were more than 10. So, at least as a starting place, I think a uniform prior from 4 to 10 is reasonable.
\item To keep things simple, let's assume that all rabbits who visit my garden are equally likely to be caught by the camera trap in a given day. Let's also assume it is guaranteed that the camera trap gets a picture every day.
\end{itemize}
\end{exercise}
\begin{exercise}
Suppose that in the criminal justice system, all prison sentences are either 1, 2, or 3 years, with an equal number of each. One day, you visit a prison and choose a prisoner at random. What is the probability that they are serving a 3-year sentence? What is the average remaining sentence of the prisoners you observe?
\end{exercise}
\begin{exercise}
If I chose a random adult in the U.S., what is the probability that they have a sibling? To be precise, what is the probability that their mother has had at least one other child?
This article from the Pew Research Center provides some relevant data: \url{https://www.pewsocialtrends.org/2015/05/07/family-size-among-mothers}. You will have to make some simplifying assumptions.
\end{exercise}
\begin{exercise}
The Doomsday argument is ``a probabilistic argument that claims to predict the number of future members of the human species given an estimate of the total number of humans born so far.'' See \url{https://en.wikipedia.org/wiki/Doomsday_argument}.
Suppose there are only two kinds of civilizations that can happen in the universe. The ``short-lived'' kind go exinct after only 200 billion individuals are born. The ``long-lived'' kind survive until 2,000 billion individuals are born. And suppose that the two kinds of civilization are equally likely. Which kind of civilization do you think we live in?
The Doomsday argument says we can use the total number of humans born so far as evidence.
According to the Population Reference Bureau, the total number of people who have ever lived is about 108 billion.
Since you were born quite recently, let's assume that you are, in fact, human being number 108 billion.
If $N$ is the total number who will ever live and we consider you to be a randomly-chosen person, it is equally likely that you could have been person 1, or $N$, or any number in between.
So what is the probability that you would be number 108 billion?
Given this data and dubious prior, what is the probability that our civilization will be short-lived?
\end{exercise}
\chapter{Odds and Addends}
This chapter presents a new way to represent a degree of certainty, called ``odds'', and a new form of Bayes's Theorem, called Bayes's Rule.
Bayes's Rule is convenient if you want to do a Bayesian update on paper or in your head.
It also sheds light on the important idea of ``evidence'' and how we can quantify the strength of evidence.
The second part of the chapter is about ``addends'', that is, quantities being added, and how we can compute their distributions.
We'll define functions that compute the distribution of a sum, difference, or result of another operation.
And then we'll use those distributions as part of a Bayesian update.
As an exercise, you'll have a chance to solve the Congress problem:
\begin{quote}
There are 538 members of the United States Congress.
Suppose we audit their investment portfolios and find that 312 of them outperform the market.
Let's assume that an honest member of Congress has only a 50\% chance of outperforming the market, but a dishonest member who trades on inside information has a 90\% chance. How many members of Congress are honest?
\end{quote}
\section{Odds}
One way to represent a degree of certainty is a probability in the form of a number between 0 and 1, but that's not the only way.
If you have ever bet on a football game or a horse race, you might have encountered another representation of certainty, called {\bf odds}.
\index{odds}
You might have heard expressions like ``the odds are
three to one,'' but you might not know what that means.
The {\bf odds in favor} of an event are the ratio of the probability
it will occur to the probability that it will not.
So if I think my team has a 75\% chance of winning, I would
say that the odds in their favor are three to one, because
the chance of winning is three times the chance of losing.
You can write odds in decimal form, but it is also common to
write them as a ratio of integers. So ``three to one'' is
written $3:1$.
When probabilities are low, it is more common to report the
{\bf odds against} rather than the odds in favor. For
example, if I think my horse has a 10\% chance of winning,
I would say that the odds against are $9:1$.
Probabilities and odds are different representations of the
same information. Given a probability, you can compute the
odds like this:
\begin{code}
def odds(p):
return p / (1-p)
\end{code}
Given the odds in favor, in decimal form, you can convert to probability like this:
\begin{code}
def prob(o):
return o / (o+1)
\end{code}
If you represent odds with a numerator and denominator, you
can convert to probability like this:
\begin{code}
def prob2(yes, no):
return yes / (yes + no)
\end{code}
When I work with odds in my head, I find it helpful to picture
people at the track. If 20\% of them think my horse will win,
then 80\% of them don't, so the odds in favor are $20:80$ or
$1:4$.
If the odds are $5:1$ against my horse, then five out of six
people think she will lose, so the probability of winning
is $1/6$.
\index{horse racing}
\section{Bayes's Rule}
\index{Bayes's Rule}
In Chapter~\ref{intro} I wrote Bayes's theorem in the {\bf probability
form}:
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
If we have two hypotheses, $A$ and $B$,
we can write the ratio of posterior probabilities like this:
\[ \frac{\p{A|D}}{\p{B|D}} = \frac{\p{A}~\p{D|A}}
{\p{B}~\p{D|B}} \]
Notice that the total probability of the data, \p{D}, drops out of
this equation.
Writing \odds{A} for odds in favor of $A$, we use the definition of odds to write:
\[ \odds{A} = \frac{\p{A}}{1-\p{A}} \]
If $A$ and $B$ are mutually exclusive and collectively exhaustive,
that means $\p{B} = 1 - \p{A}$, so we can write
\[ \odds{A} = \frac{\p{A}}{\p{B}} \]
By the same process, we can write the posterior odds like this:
\[ \odds{A|D} = \frac{\p{A|D}}{\p{B|D}} \]
Putting it all together, we have:
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
This is Bayes's Rule, which says that the posterior odds are the prior odds times the likelihood ratio.
This form of Bayes's Theorem is convenient for computing a Bayesian update on paper or in your head.
For example, let's go back to the cookie problem:
\index{cookie problem}
\begin{quote}
Suppose there are two bowls of cookies.
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
Bowl 2 contains 20 of each.
Now suppose you choose one of the bowls at random and, without
looking, select a cookie at random.
The cookie is vanilla.
What is the probability that it came from Bowl 1?
\end{quote}
The prior probability is 50\%, so the prior odds are $1$.
The likelihood ratio is $\frac{3}{4} / \frac{1}{2}$, or $3/2$.
So the posterior odds are $3/2$, which corresponds to probability
$3/5$.
\section{Oliver's blood}
\label{oliver}
\index{Oliver's blood problem}
\index{MacKay, David}
I'll use Bayes's Rule to solve another problem from MacKay's {\it Information Theory, Inference, and Learning Algorithms}:
\begin{quote}
Two people have left traces of their own blood at the scene of
a crime. A suspect, Oliver, is tested and found to have type
`O' blood. The blood groups of the two traces are found to
be of type `O' (a common type in the local population, having frequency
60\%) and of type `AB' (a rare type, with frequency 1\%).
Do these data [the traces found at the scene] give evidence
in favor of the proposition that Oliver was one of the people
[who left blood at the scene]?
\end{quote}
To answer this question, we need to think about what it means
for data to give evidence in favor of (or against) a hypothesis.
Intuitively, we might say that data favor a hypothesis if the
hypothesis is more likely in light of the data than it was before.
\index{evidence}
In the cookie problem, the prior odds are $1$, or probability 50\%.
The posterior odds are $3/2$, or probability 60\%.
So the vanilla cookie is evidence in favor of Bowl 1.
Bayes's Rule provides a way to make this intuition more precise. Again
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
Dividing through by \odds{A}, we get:
\[ \frac{\odds{A|D}}{\odds{A}} = \frac{\p{D|A}}{\p{D|B}} \]
The term on the left is the ratio of the posterior and prior odds.
The term on the right is the likelihood ratio, also called the {\bf Bayes
factor}.
\index{likelihood ratio}
\index{Bayes factor}
If the Bayes factor is greater than 1, that means that the
data were more likely under $A$ than under $B$.
And that means that the odds are greater, in light of the data, than they were before.
If the Bayes factor is less than 1, that means the data were
less likely under $A$ than under $B$, so the odds in
favor of $A$ go down.
Finally, if the Bayes factor is exactly 1, the data are equally
likely under either hypothesis, so the odds do not change.
Let's apply that to the problem at hand. If Oliver is
one of the people who left blood at the crime scene, he
accounts for the `O' sample; in that case, the probability of the data
is the probability that a random member of the population
has type `AB' blood, which is 1\%.
If Oliver did not leave blood at the scene, we have two
samples to account for. If we choose two random people from
the population, what is the chance of finding one with type `O'
and one with type `AB'? Well, there are two ways it might happen:
the first person might have type `O' and the second
`AB', or the other way around. So the total probability is
$2 (0.6) (0.01) = 1.2\%$.
The likelihood of the data is slightly higher if Oliver is
{\it not} one of the people who left blood at the scene, so
the blood data is actually evidence against Oliver's guilt.
\index{evidence}
This example is a little contrived, but it is demonstrates
the counterintuitive result that data {\it consistent} with
a hypothesis are not necessarily {\it in favor of}
the hypothesis.
If this result still bothers you, this way of thinking might help: the data consist of a common event, type `O' blood, and a rare event, type `AB' blood.
If Oliver accounts for the common event, that leaves the rare
event unexplained. If Oliver doesn't account for the
`O' blood, we have two chances to find someone in the
population with `AB' blood. And that factor of two makes
the difference.
\section{Addends}
\label{addends}
Suppose you roll two dice and add them up. What is the distribution of the sum?
I'll use the following function to create a \py{Pmf} that represents the outcome of a die:
\begin{code}
def make_die(sides):
outcomes = np.arange(1, sides+1)
die = Pmf(1/sides, outcomes)
return die
\end{code}
On a six-sided die, there are six possible outcomes, 1 through 6, all equally likely.
\begin{code}
die = make_die(6)
\end{code}
If we roll two dice and add them up, there are 11 possible outcomes, 2 through 12, but they are not equally likely.
To compute the distribution of the sum, we can enumerate the possible outcomes.
The following loop enumerates the quantities and probabilities from a \py{Pmf}:
\begin{code}
for q, p in die.items():
print(q, p)
\end{code}
\py{items} iterates though the quantities and probabilities in the \py{Pmf}.
So this loop enumerates all pairs of quantities and their probabilities:
\begin{code}
for q1, p1 in pmf1.items():
for q2, p2 in pmf2.items():
q = q1 + q2
p = p1 * p2
\end{code}
Each time through the loop \py{q} gets the sum of the pair of quantities, and \py{p} gets the probability of the pair.
Because the same sum might appear more than once, we have to add up the total probability for each sum.
And that's how this function works:
\begin{code}
def add_dist(pmf1, pmf2):
res = Pmf()
for q1, p1 in pmf1.items():
for q2, p2 in pmf2.items():
q = q1 + q2
p = p1 * p2
res[q] = res(q) + p
return res
\end{code}
The parameters are \py{Pmf} objects representing distributions.
The first line creates an empty \py{Pmf}.
Each time through the loop, we compute \py{q} and \py{p} and then increment the probability associated with \py{q}.
Notice a subtle element of this line:
\begin{code}
res[q] = res(q) + p
\end{code}
I use parentheses on the right side of the assignment, which returns 0 if \py{q} does not appear yet in \py{res}.
I use brackets on the left side of the assignment to create or update an element in \py{res}; using parentheses on the left side would not work.
\py{Pmf} provides a method that does the same thing.
You can call it as a method, like this.
\begin{code}
twice = die.add_dist(die)
\end{code}
Or as a function, like this:
\begin{code}
twice = Pmf.add_dist(die, die)
\end{code}
If we have a sequence of \py{Pmf} objects that represent dice, we can compute the distribution of the sum like this:
\begin{code}
def add_dist_seq(seq):
total = seq[0]
for other in seq[1:]:
total = total.add_dist(other)
return total
\end{code}
So we can compute the sum of three dice like this:
\begin{code}
dice = [die] * 3
thrice = add_dist_seq(dice)
\end{code}
Figure~\ref{fig05-01} shows what these three distributions look like:
\begin{itemize}
\item The distribution of a single die is uniform from 1 to 6.
\item The sum of two dice has a triangle distribution between 2 and 12.
\item The sum of three dice has a bell-shaped distribution between 3 and 18.
\end{itemize}
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig05-01.pdf}}
\caption{Distribution of outcomes for one six-sided die, two dice, and three dice.}
\label{fig05-01}
\end{figure}
As an aside, this example demonstrates the Central Limit Theorem, which says that the distribution of a sum converges on a bell-shaped normal distribution, at least under some conditions.
\section{Gluten}
In 2015 I read a paper that tested whether people diagnosed with gluten sensitivity (but not celiac disease) were not able to distinguish gluten flour from non-gluten flour in a blind challenge (\url{https://onlinelibrary.wiley.com/doi/full/10.1111/apt.13372}).
Out of 35 subjects, 12 correctly identified the gluten flour based on resumption of symptoms while they were eating it. Another 17 wrongly identified the gluten-free flour based on their symptoms, and 6 were unable to distinguish.
The authors conclude, ``Double-blind gluten challenge induces symptom recurrence in just one-third of patients.''
This conclusion seems odd to me, because if none of the patients were sensitive to gluten, we would expect some of them to identify the gluten flour by chance.
So here's the question: based on this data, how many of the subjects are sensitive to gluten?
We can use Bayes's Theorem to answer this question, but first we have to make some modeling decisions.
I'll assume:
\begin{itemize}
\item People who are sensitive to gluten have a 95\% chance of correctly identifying gluten flour under the challenge conditions, and
\item People who are not sensitive have a 40\% chance of identifying the gluten flour by chance (and a 60\% chance of either choosing the other flour or failing to distinguish).
\end{itemize}
These particular values are arbitrary, but the results are not sensitive to these choices.
I will solve this problem in two steps. First, assuming that we know how many subjects are sensitive, I will compute the distribution of the data. Then, using the likelihood of the data, I will compute the posterior distribution of the number of sensitive patients.
The first is the {\bf forward problem}; the second is the {\bf inverse problem}.
\section{Forward problem}
Suppose we know that 10 of the 35 subjects are sensitive to gluten. That means that 25 are not:
\begin{code}
n = 35
n_sensitive = 10
n_insensitive = n - n_sensitive
\end{code}
Each sensitive subject has a 95\% chance of identifying the gluten flour, so the number of correct identifications follows a binomial distribution with \py{p=0.95}:
\begin{code}
dist_sensitive = make_binomial(n_sensitive, 0.95)
\end{code}
And similarly for the insensitive subjects:
\begin{code}
dist_insensitive = make_binomial(n_insensitive, 0.4)
\end{code}
\py{make_binomial} returns a \py{Pmf} that represents the distribution of correct identifications.
So we can use \py{add_dist} to compute the total number of correct identifications in both groups:
\begin{code}
dist_total = Pmf.add_dist(dist_sensitive, dist_insensitive)
\end{code}
Figure~\ref{fig05-02} shows the distribution of correct identifications among sensitive and insensitive subjects, and the total.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig05-02.pdf}}
\caption{Distribution of correct identifications among sensitive and insensitive subjects, and the total.}
\label{fig05-02}
\end{figure}
Of the 10 sensitive subject, we expect most of them to identify the gluten flour correctly.
Of the 25 insensitive subjects, we expect about 10 to identify the gluten flour by chance.
So we expect about 20 correct identifications in total.
This is the answer to the forward problem: given the number of sensitive subjects, we can compute the distribution of the data.
\section{Inverse Problem}
Now let's solve the inverse problem: given the data, we'll compute the posterior distribution of the number of sensitive subjects.
Here's how. I'll loop through the possible values of \py{n_sensitive} and compute the distribution of the data for each:
\begin{code}
table = pd.DataFrame()
for n_sensitive in range(1, n):
n_insensitive = n - n_sensitive
dist_sensitive = make_binomial(n_sensitive, 0.95)
dist_insensitive = make_binomial(n_insensitive, 0.4)
dist_total = Pmf.add_dist(dist_sensitive, dist_insensitive)
table[n_sensitive] = dist_total
\end{code}
I store each distribution as a column in a Pandas DataFrame.
When \py{n_sensitive} is 0 or \py{n}, the distribution of the data is a simple binomial, not the sum of two binomials:
\begin{code}
table[0] = make_binomial(n, 0.4)
table[n] = make_binomial(n, 0.95)
\end{code}
Figure~\ref{fig05-03} shows several columns from this table, corresponding to several hypothetical values of \py{n_sensitive}:
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig05-03.pdf}}
\caption{Distribution of the number of correct identification for different values of \py{n_sensitive}.}
\label{fig05-03}
\end{figure}
Now we can use this table to compute the likelihood of the data:
\begin{code}
likelihood = table.loc[12]
\end{code}
\py{loc} selects a row from the table.
The row with index 12 contains the probability of 12 correct identifications for each hypothetical value of \py{n_sensitive}.
And that's exactly the likelihood we need to do a Bayesian update.
I'll use a uniform prior, which implies that I would be equally surprised by any value of \py{n_sensitive}:
\begin{code}
hypos = np.arange(n+1)
prior = Pmf(1, hypos)
\end{code}
And here's the update:
\begin{code}
posterior = prior * likelihood
posterior.normalize()
\end{code}
Figure~\ref{fig05-04} shows posterior distributions of \py{n_sensitive} based on the actual data, 12 correct identifications, and another hypothetical outcome, 20 correct identifications.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig05-04.pdf}}
\caption{Posterior distributions of \py{n_sensitive}.}
\label{fig05-04}
\end{figure}
With 12 correct identifications, the most likely conclusion is that none of the subjects are sensitive to gluten.
If there had been 20 correct identifications, the most likely conclusion would be that 11-12 of the subjects were sensitive.
\section{Summary}
This chapter presents two topics that are almost unrelated except that they make the title of the chapter catchy.
The first part of the chapter is about Bayes's Rule, evidence, and how we can quantify the strength of evidence using a likelihood ratio or Bayes factor.
The second part is about functions that compute the distribution of a sum, product, or the result of another binary operation.
We can use these functions to solve a forward problem and inverse problems; that is, given the parameters of a system, we can compute the distribution of the data or, given the data, we can compute the distribution of the parameters.
In the following exercises, you'll have a chance to practice what you learned.
\section{Exercises}
The code for this chapter is in \py{chap05.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap05.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise}
Let's use Bayes's Rule to solve the Elvis problem from Section~\ref{elvis}:
\begin{quote}
Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?
\end{quote}
In 1935, about 2/3 of twins were fraternal and 1/3 were identical.
The question contains two pieces of information we can use to update this prior.
First, Elvis's twin was also male, which is more likely if they were identical twins, with a likelihood ratio of 2.
Also, Elvis's twin died at birth, which is more likely if they were identical twins, with a likelihood ratio of 1.25.
If you are curious about where those number come from, I wrote a blog post about it at \url{https://www.allendowney.com/blog/2020/01/28/the-elvis-problem-revisited}.
\end{exercise}
\begin{exercise}
The following is an interview question that appeared on glassdoor.com, attributed to Facebook (\url{https://www.glassdoor.com/Interview/You-re-about-to-get-on-a-plane-to-Seattle-You-want-to-know-if-you-should-bring-an-umbrella-You-call-3-random-friends-of-y-QTN_519262.htm}):
\begin{quote}
You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that ``Yes'' it is raining. What is the probability that it's actually raining in Seattle?
\end{quote}
Use Bayes's Rule to solve this problem. As a prior you can assume that it rains in Seattle about 10\% of the time.
\end{exercise}
\begin{exercise}
According to the CDC, people who smoke are about 25 times more likely to develop lung cancer than nonsmokers (see \url{https://www.cdc.gov/tobacco/data_statistics/fact_sheets/health_effects/effects_cig_smoking/}).
Also according to the CDC, about 14\% of adults in the U.S. are smokers (see \url{https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm}).
If you learn that someone has lung cancer, what is the probability they are a smoker?
\end{exercise}
\begin{exercise}
In {\it Dungeons~\&~Dragons}, the amount of damage a goblin can withstand is the sum of two six-sided dice. The amount of damage you inflict with a short sword is determined by rolling one six-sided die.
A goblin is defeated if the total damage you inflict is greater than or equal to the amount it can withstand.
Suppose you are fighting a goblin and you have already inflicted 3 points of damage. What is your probability of defeating the goblin with your next successful attack?
Hint: You can use \py{Pmf.add_dist} to add a constant amount, like 3, to a \py{Pmf}.
\end{exercise}
\begin{exercise}
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
I choose one of the dice at random, roll it twice, multiply the outcomes, and report that the product is 12.
What is the probability that I chose the 8-sided die?
Hint: \py{Pmf} provides a function called \py{mul_dist} that takes two \py{Pmf} objects and returns a \py{Pmf} that represents the distribution of the product.
\end{exercise}
\begin{exercise}
{\it Betrayal at House on the Hill} is a strategy game in which characters with different attributes explore a haunted house. Depending on their attributes, the characters roll different numbers of dice. For example, if attempting a task that depends on knowledge, Professor Longfellow rolls 5 dice, Madame Zostra rolls 4, and Ox Bellows rolls 3. Each die yields 0, 1, or 2 with equal probability.
If a randomly chosen character attempts a task three times and rolls a total of 3 on the first attempt, 4 on the second, and 5 on the third, which character do you think it was?
\end{exercise}
\begin{exercise}
There are 538 members of the United States Congress.
Suppose we audit their investment portfolios and find that 312 of them outperform the market.
Let's assume that an honest member of Congress has only a 50\% chance of outperforming the market, but a dishonest member who trades on inside information has a 90\% chance. How many members of Congress are honest?
\end{exercise}
\chapter{Minima, Maxima, and Mixtures}
In the previous chapter we computed distributions of sums, differences, products, and quotients.
In this chapter, we'll compute distributions of minima and maxima use them to solve inference problems.
Then we'll look at distributions that are mixtures of other distributions, which will turn out to be particularly useful for making predictions.
But we'll start with a powerful tool for working with distributions, the cumulative distribution function.
\section{Cumulative distribution functions}
So far we have been using probability mass functions to represent distributions.
A useful alternative is the {\bf cumulative distribution function}, or CDF.
As an example, I'll use the posterior distribution from the Euro problem, which we computed in Section~\ref{binomlike}.
\begin{code}
hypos = np.linspace(0, 1, 101)
pmf = Pmf(1, hypos)
data = 140, 250
update_binomial(pmf, data)
\end{code}
The CDF is the cumulative sum of the PMF, so we can compute it like this:
\begin{code}
cumulative = pmf.cumsum()
\end{code}
The result is a Pandas Series, so we can use the bracket operator to select an element:
\begin{code}
cumulative[0.61]
\end{code}
The result is about 0.96, which means that the total probability of all quantities less than or equal to 0.61 is 96\%.
To go the other way --- to look up a probability and get the corresponding quantile --- we can use interpolation:
\begin{code}
from scipy.interpolate import interp1d
ps = cumulative.values
qs = cumulative.index
interp = interp1d(ps, qs)
interp(0.96)
\end{code}
The result is about 0.61, so that confirms that the 96th percentile of this distribution is 0.61.
\py{empiricaldist} provides a class called \py{Cdf} that represents a cumulative distribution function.
Given a \py{Pmf}, you can compute a \py{Cdf} like this:
\begin{code}
cdf = pmf.make_cdf()
\end{code}
\py{make_cdf} uses \py{np.cumsum} to compute the cumulative sum of the probabilities.
Figure~\ref{fig06-01} shows the PMF and CDF of this distribution.
The range of the CDF is always from 0 to 1, in contrast with the PMF, where the maximum can be any probability.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig06-01.pdf}}
\caption{Posterior distribution from the Euro problem represented as a PMF and CDF.}
\label{fig06-01}
\end{figure}
You can use brackets to select an element from a \py{Cdf}:
\begin{code}
cdf[0.61]
\end{code}
But if you look up a value that's not in the distribution, you get a \py{KeyError}.
You can also call a \py{Cdf} as a function, using parentheses.
If the argument does not appear in the \py{Cdf}, it interpolates between quantities.
\begin{code}
cdf(0.615)
\end{code}
Going the other way, you can use \py{quantile} to look up a cumulative probability and get the corresponding quantity:
\begin{code}
cdf.quantile(0.96)
\end{code}
\py{Cdf} also provides \py{credible_interval}, which computes a credible interval that contains the given probability:
\begin{code}
cdf.credible_interval(0.9)
\end{code}
CDFs and PMFs are equivalent in the sense that they contain the
same information about the distribution, and you can always convert
from one to the other.
Given a \py{Cdf}, you can get the equivalent \py{Pmf} like this:
\begin{code}
pmf = cdf.make_pmf()
\end{code}
\py{make_pmf} uses \py{np.diff} to compute differences between consecutive cumulative probabilities.
One reason \py{Cdf} objects are useful is that they compute quantiles efficiently.
Another is that they make it easy to compute the distribution of a maximum or minimum, as we'll see in the next section.
\section{Best Three of Four}
In {\it Dungeons~\&~Dragons}, each character has six attributes: strength, intelligence, wisdom, dexterity, constitution, and charisma.
To generate a new character, players roll four 6-sided dice for each attribute and add up the best three.
For example, if I roll for strength and get 1, 2, 3, 4 on the dice, my character's strength would be 9.
As an exercise, let's figure out the distribution of these attributes.
Then, for each character, we'll figure out the distribution of their best attribute.
In Section~\ref{addends}, we computed the distribution of the sum of three dice like this:
\begin{code}
die = make_die(6)
dice = [die] * 3
pmf_3d6 = add_dist_seq(dice)
\end{code}
The definitions of \py{make_die} and \py{add_dist_seq} are in that section.
But if we roll four dice and add up the best three, computing the distribution of the sum is a bit more complicated.
I'll estimate the distribution by simulating 10,000 rolls.
First I'll create an array of random values from 1 to 6, with 10,000 rows and 4 columns:
\begin{code}
n = 10000
a = np.random.randint(1, 7, size=(n, 4))
\end{code}
To find the best three outcomes in each row, I'll sort along \py{axis=1}, which means across the columns.
\begin{code}
a.sort(axis=1)
\end{code}
Finally, I'll select the last three columns and add them up.
\begin{code}
t = a[:, 1:].sum(axis=1)
\end{code}
Now \py{t} is an array with a single column and 10,000 rows.
We can compute the PMF of the values in \py{t} like this:
\begin{code}
pmf_4d6 = Pmf.from_seq(t)
\end{code}
Figure~\ref{fig06-02} shows the distribution of the sum of three dice, \py{pmf_3d6}, and the distribution of the best three out of four, \py{pmf_4d6}.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig06-02.pdf}}
\caption{Distributions of the sum of three dice and the best three of four.}
\label{fig06-02}
\end{figure}
As you might expect, choosing the best three out of four tends to yield higher values.
Next we'll find the distribution for the maximum of six attributes, each the sum of the best three of four dice.
\section{Maximum of Six}
To compute the distribution of a maximum or minimum, we can make good use of the cumulative distribution function.
First, I'll compute the \py{Cdf} of the best three of four distribution:
\begin{code}
cdf_4d6 = pmf_4d6.make_cdf()
\end{code}
Recall that \py{Cdf(x)} is the sum of probabilities for quantities less than or equal to \py{x}.
Equivalently, it is the probability that a random value chosen from the distribution is less than or equal to \py{x}.
Now suppose I draw 6 values from this distribution.
The probability that all 6 of them are less than or equal to \py{x} is \py{Cdf(x)} raised to the 6th power, which we can compute like this:
\begin{code}
cdf_4d6**6
\end{code}
If all 6 values are less than or equal to \py{x}, that means that their maximum is less than or equal to \py{x}.
So the result is the CDF of their maximum.
We can convert it to a \py{Cdf} object, like this:
\begin{code}
cdf_max6 = Cdf(cdf_4d6**6)
\end{code}
And compute the equivalent \py{Pmf} like this:
\begin{code}
pmf_max6 = cdf_max6.make_pmf()
\end{code}
Figure~\ref{fig06-03} shows the result.
Most characters have at least one attribute greater than 12; almost 10\% of them have an 18.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig06-03.pdf}}
\caption{Distribution for the minimum and maximum of six attributes.}
\label{fig06-03}
\end{figure}
\py{Pmf} and \py{Cdf} provide \py{max_dist}, which does the same computation.
We can compute the \py{Pmf} of the maximum like this:
\begin{code}
pmf_4d6.max_dist(6)
\end{code}
And the \py{Cdf} of the maximum like this:
\begin{code}
cdf_4d6.max_dist(6)
\end{code}
In the next section we'll find the distribution of the minimum.
The process is similar, but a little more complicated.
See if you can figure it out before you go on.
\section{Minimum of Six}
In the previous section we computed the distribution of a character's best attribute.
Now let's compute the distribution of the worst.
To compute the distribution of the minimum, we'll use the {\bf complementary CDF}, which we can compute like this:
\begin{code}
prob_gt = 1 - cdf_4d6
\end{code}
As the variable name suggests, the complementary CDF is the probability that a value from the distribution is greater than \py{x}.
If we draw 6 values from the distribution, the probability that all 6 exceed \py{x} is:
\begin{code}
prob_gt6 = prob_gt**6
\end{code}
If all 6 exceed \py{x}, that means their minimum exceeds \py{x}, so \py{prob_gt6} is the complementary CDF of the minimum.
And that means we can compute the CDF of the minimum like this:
\begin{code}
prob_le6 = 1 - prob_gt6
\end{code}
The result is a Pandas Series that represents the CDF of the minimum of six attributes.
We can put those values in a \py{Cdf} object like this:
\begin{code}
cdf_min6 = Cdf(prob_le6)
\end{code}
Figure~\ref{fig06-03} shows the result.
\py{Pmf} and \py{Cdf} provide \py{min_dist}, which does the same computation.
We can compute the \py{Pmf} of the minimum like this:
\begin{code}
pmf_4d6.min_dist(6)
\end{code}
And the \py{Cdf} of the minimum like this:
\begin{code}
cdf_4d6.min_dist(6)
\end{code}
In the exercises at the end of the chapter, you'll use distributions of the minimum and maximum to do Bayesian inference.
But first we'll see what happens when we mix distributions.
\section{Mixtures}
\label{mixture}
Let's do one more example inspired by {\it Dungeons~\&~Dragons}.
Suppose I have a 4-sided die and a 6-sided die.
I choose one of them at random and roll it.
What is the distribution of the outcome?
If you know which die it is, the answer is easy.
A die with \py{n} sides yields a uniform distribution from 1 to \py{n}, including both.
We can compute \py{Pmf} objects to represent the dice, like this:
\begin{code}
d4 = make_die(4)
d6 = make_die(6)
\end{code}
To compute the distribution of the mixture, we can compute the average of the two distributions by adding them and dividing the result by 2:
\begin{code}
total = Pmf.add(d4, d6, fill_value=0) / 2
\end{code}
We have to use \py{Pmf.add} with \py{fill_value=0} because the two distributions don't have the same set of quantities.
If they did, we could use the \py{+} operator.
Now suppose I have a 4-sided die and {\it two} 6-sided dice.
Again, I choose one of them at random and roll it.
What is the distribution of the outcome?
We can solve this problem by computing a weighted average of the distributions, like this:
\begin{code}
total = Pmf.add(d4, 2*d6, fill_value=0) / 3
\end{code}
Finally, suppose we have a box with the following mix:
\begin{verbatim}
1 4-sided die
2 6-sided dice
3 8-sided dice
\end{verbatim}
If I draw a die from this mix at random, we can use a \py{Pmf} to represent the hypothetical number of sides on the die:
\begin{code}
hypos = [4,6,8]
counts = [1,2,3]
pmf_dice = Pmf(counts, hypos)
\end{code}
And I'll make a sequence of \py{Pmf} objects to represent the dice:
\begin{code}
dice = [make_die(sides) for sides in hypos]
\end{code}
Now we have to multiply each distribution in \py{dice} by the corresponding probabilities in \py{pmf_dice}.
To express this computation concisely, it is convenient to put the distributions into a Pandas DataFrame:
\begin{code}
pd.DataFrame(dice)
\end{code}
The result is a DataFrame with one row for each distribution and one column for each possible outcome.
Not all rows are the same length, so Pandas fills the extra spaces with the special value \py{NaN}, which stands for ``not a number''.
We can use `fillna` to replace the \py{NaN} values with 0.
\begin{code}
pd.DataFrame(dice).fillna(0)
\end{code}
Before we multiply by the probabilities in \py{pmf_dice}, we have to transpose the matrix so the distributions run down the columns rather than across the rows:
\begin{code}
df = pd.DataFrame(dice).fillna(0).transpose()
\end{code}
Now we can multiply by the probabilities:
\begin{code}
df *= pmf_dice.ps
\end{code}
And add up the weighted distributions:
\begin{code}
total = df.sum(axis=1)
\end{code}
The argument \py{axis=1} means we want to sum across the rows.
The result is a Pandas Series.
Putting it all together, here's a function that makes a weighted mixture of distributions.
\begin{code}
def make_mixture(pmf, pmf_seq):
df = pd.DataFrame(pmf_seq).fillna(0).transpose()
df *= pmf.ps
total = df.sum(axis=1)
return Pmf(total)
\end{code}
The first parameter is a \py{Pmf} that makes from each hypothesis to a probability.
The second parameter is a sequence of \py{Pmf} objects, one for each hypothesis.
We can call it like this:
\begin{code}
mix = make_mixture(pmf_dice, dice)
\end{code}
Figure~\ref{fig06-04} shows the result, which is a mixture of uniform distributions.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig06-04.pdf}}
\caption{Mixture of uniform distributions from three kinds of dice.}
\label{fig06-04}
\end{figure}
\section{Summary}
We have seen two representations of distributions: Pmfs and Cdfs.
These representations are equivalent in the sense that they contain
the same information, so you can convert from one to the other. The
primary difference between them is performance: some operations are
faster and easier with a Pmf; others are faster with a Cdf.
\index{Pmf} \index{Cdf}
In this chapter we used `Cdf` objects to compute distributions of maxima and minima; these distributions are useful for inference if we are given a maximum or minimum as data.
We also computed mixtures of distributions, which we will use in the next chapter to make predictions.
\section{Exercises}
The code for this chapter is in \py{chap06.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap06.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise}
When you generate a {\it Dungeons~\&~Dragons} character, instead of rolling dice, you can use the "standard array" of attributes, which is 15, 14, 13, 12, 10, and 8.
Do you think you are better off using the standard array or (literally) rolling the dice?
Compare the distribution of the values in the standard array to the distribution we computed for the best three out of four:
\begin{itemize}
\item Which distribution has higher mean? Use the \py{mean} method.
\item Which distribution has higher standard deviation? Use the \py{std} method.
\item The lowest value in the standard array is 8. For each attribute, what is the probability of getting a value less than 8? If you roll the dice six times, what's the probability that at least one of your attributes is less than 8?
\item The highest value in the standard array is 15. For each attribute, what is the probability of getting a value greater than 15? If you roll the dice six times, what's the probability that at least one of your attributes is greater than 15?
\end{itemize}
\end{exercise}
\begin{exercise}
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
I choose one of the dice at random, roll it, and report that the outcome is a 1.
If I roll the same die again, what is the probability that I get another 1?
Hint: Compute the posterior distribution as we have done before and pass it as one of the arguments to \py{make_mixture}.
\end{exercise}
\begin{exercise}
Suppose I have two boxes of dice:
\begin{itemize}
\item One contains a 4-sided die and a 6-sided die.
\item The other contains a 6-sided die and an 8-sided die.
\end{itemize}
I choose a box at random, choose a die, and roll it 3 times. If I get 2, 4, and 6, which box do you think I chose?
\end{exercise}
\newcommand{\Poincare}{Poincar\'{e}}
\begin{exercise}
Henri \Poincare~was a French mathematician who taught at the Sorbonne around 1900. The following anecdote about him is probably fabricated, but it makes an interesting probability problem.
Supposedly \Poincare~suspected that his local bakery was selling loaves of bread that were lighter than the advertised weight of 1 kg, so every day for a year he bought a loaf of bread, brought it home and weighed it. At the end of the year, he plotted the distribution of his measurements and showed that it fit a normal distribution with mean 950 g and standard deviation 50 g. He brought this evidence to the bread police, who gave the baker a warning.
For the next year, \Poincare~continued the practice of weighing his bread every day. At the end of the year, he found that the average weight was 1000 g, just as it should be, but again he complained to the bread police, and this time they fined the baker.
Why? Because the shape of the distribution was asymmetric. Unlike the normal distribution, it was skewed to the right, which is consistent with the hypothesis that the baker was still making 950 g loaves, but deliberately giving \Poincare~the heavier ones.
To see whether this anecdote is plausible, let's suppose that when the baker sees \Poincare~coming, he hefts \py{n} loaves of bread and gives \Poincare~the heaviest one. How many loaves would the baker have to heft to make the average of the maximum 1000 g?
\end{exercise}
\begin{exercise}
Two doctors fresh out of medical school are arguing about whose hospital delivers more babies. The first doctor say, ``I've been at Hospital A for two weeks, and already we've had a day when we delivered 20 babies.''
The second doctor says, ``I've only been at Hospital B for one week, but already there's been a 19-baby day.''
Which hospital do you think delivers more babies on average? You can assume that the number of babies born in a day is well modeled by a Poisson distribution with parameter $\lambda$ (see \url{https://en.wikipedia.org/wiki/Poisson_distribution}).
\end{exercise}
\begin{exercise}
This question is related to a method I developed for estimating the minimum time for a packet of data to travel through a path in the internet.
Suppose I drive the same route three times and the fastest of the three attempts takes 8 minutes.
There are two traffic lights on the route. As I approach each light, there is a 40\% chance that it is green; in that case, it causes no delay. And there is a 60\% change it is red; in that case it causes a delay that is uniformly distributed from 0 to 60 seconds.
What is the posterior distribution of the time it would take to drive the route with no delays?
\end{exercise}
\chapter{Poisson Processes}
\label{prediction}
\newcommand{\lam}{\mathtt{\lambda}}
\section{The World Cup Problem}
In the 2018 FIFA World Cup final, France defeated Croatia 4 goals to 2. Based on this outcome:
\begin{enumerate}
\item How confident should we be that France is the better team?
\item If the same teams played again, what is the chance France would win again?
\end{enumerate}
To answer these questions, we have to make some modeling decisions.
First, I'll assume that for any team against any other team there is some unknown goal-scoring rate, measured in goals per game, which I'll denote
$\lam$.
Second, I'll assume that a goal is equally likely during any minute of a game. So, in a 90 minute game, the probability of scoring during any minute is $\lam / 90$.
Third, I'll assume that a team never scores twice during the same minute.
Of course, none of these assumptions is absolutely true in the real world, but I think they are reasonable simplifications, and as we will see, they allow use to derive some useful results.
As George Box said, ``All models are wrong; some are useful''
(see \url{https://en.wikipedia.org/wiki/All_models_are_wrong}).
My strategy for answering this question is
\begin{enumerate}
\item Use statistics from previous games to choose a prior
distribution for $\lam$.
\item Use the score from the game to estimate $\lam$ for each team.
\item Use the posterior distributions of $\lam$ to compute
distribution of goals for each team and the probability that each team wins
the next game.
\end{enumerate}
\section{Poisson processes}
In mathematical statistics, a {\bf process} is a stochastic model of a
physical system (``stochastic'' means that the model has some kind of
randomness in it).
For example, a {\bf Bernoulli process} is a model of a
sequence of events, called trials, in which each trial has two
possible outcomes, usually called success and failure.
So a Bernoulli process
is a natural model for a series of coin flips, or a series of shots on
goal.
\index{process}
\index{Bernoulli process}
A {\bf Poisson process} is the continuous version of a Bernoulli process,
where an event can occur at any point in time with equal probability.
Poisson processes can be used to model customers arriving in a store,
buses arriving at a bus stop, or goals scored in a soccer game.
\index{Poisson process}
In many real systems the probability of an event changes over time.
Customers are more likely to go to a store at certain times of day,
buses are supposed to arrive at fixed intervals, and goals are more
or less likely at different times during a game.
But all models are based on simplifications, and in this case modeling
a soccer game with a Poisson process is a reasonable choice. Heuer,
M\"{u}ller and Rubner (2010) analyze scoring in a German soccer league
and come to the same conclusion (see
\url{http://www.cimat.mx/Eventos/vpec10/img/poisson.pdf}).
The benefit of using this model is that we can compute the distribution
of goals per game efficiently, as well as the distribution of time
between goals. Specifically, if the average number of goals
in a game is $\lam$, the distribution of goals per game is
given by the Poisson PMF:
\index{Poisson distribution}
\[ f(k; \lam) = \lam^k \exp(-\lam) ~/~ k! \]
And the distribution of time between goals is given by the
exponential PDF:
\index{exponential distribution}
\[ f(t; \lam) = \lam \exp(-\lam t) \]
Let's start with the Poisson distribution.
\section{The Poisson Distribution}
Suppose we know that the goal-scoring rate for one team against another is $\lam = 1.4$ goals per game.
The following function computes the Poisson distribution of \py{k}, the number of goals the team scores in one game.
\begin{code}
from scipy.stats import poisson
def make_poisson_pmf($\lam$, high):
qs = np.arange(high)
ps = poisson.pmf(qs, $\lam$)
pmf = Pmf(ps, qs)
pmf.normalize()
return pmf
\end{code}
The first parameter is the goal-scoring rate.
The second is the upper bound of the distribution.
In theory the Poisson distribution goes to infinity, but we can cut if off when we get to quantities with negligible probability.
As usual, the \py{qs} are the quantities in the distribution and the \py{ps} are their probabilities.
SciPy provides \py{poisson}, which has a function called \py{pmf} that evaluates the PMF of the Poisson distribution.
The return value is a normalized \py{Pmf}.
We can call \py{make_poisson_pmf} like this:
\begin{code}
pmf_goals = make_poisson_pmf($\lam$=1.4, high=10)
\end{code}
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig07-01.pdf}}
\caption{Poisson distribution with $\lam=1.4$.}
\label{fig07-01}
\end{figure}
Figure~\ref{fig07-01} shows the result, a Poisson distribution with $\lam=1.4$.
The most likely outcomes are 0, 1, and 2; higher values are possible but increasingly unlikely.
Values above 7 are negligible.
If we know the goal scoring rate, we can predict the number of goals.
Now let's turn it around: given a number of goals, what can we say about the goal-scoring rate?
To answer that, we need to think about the prior distribution of $\lam$.
And for that, I am going to use a Gamma distribution.
\section{The Gamma Distribution}
If you have ever seen a soccer game, you have some information about $\lam$.
In most games, teams score a few goals each.
In rare cases, a team might score more than 5 goals, but they almost never score more than 10.
Using data from previous World Cups
I estimate that each team scores about 1.4 goals per game, on average (see \url{https://www.statista.com/statistics/269031/goals-scored-per-game-at-the-fifa-world-cup-since-1930/}). So I'll set the mean of $\lam$ to be 1.4.
For a good team against a bad one, we expect $\lam$ to be higher; for a bad team against a good one, we expect it to be lower.
To model the distribution of goal-scoring rates, I will use a gamma distribution, which I chose because:
\begin{enumerate}
\item The goal scoring rate is a continuous quantity that cannot be less than 0; the gamma distribution is appropriate for this kind of quantity.
\item The gamma distribution has only one parameter, $\alpha$, which is the mean. So it's easy to construct a gamma distribution with the mean we want.
\item As we'll see, the shape of the Gamma distribution is a reasonable choice, given what we know about soccer.
\end{enumerate}
For more about the gamma distribution, see \url{https://en.wikipedia.org/wiki/Gamma_distribution}.
The gamma distribution is continuous, but we'll approximate it with a discrete \py{Pmf}.
SciPy provides \py{gamma}, which provides \py{pdf}, which evaluates the {\bf probability density function} (PDF) of the gamma distribution.
\newcommand{\alf}{\mathtt{\alpha}}
\begin{code}
from scipy.stats import gamma
$\alf$ = 1.4
qs = np.linspace(0, 10, 101)
ps = gamma.pdf(qs, $\alf$)
\end{code}
The \py{qs} are possible values of $\lam$ from 0 to 10.
The \py{ps} are probability densities, which we can think of as unnormalized probabilities.
If we put the densities in a \py{Pmf} and normalize them, like this:
\begin{code}
prior = Pmf(ps, qs)
prior.normalize()
\end{code}
The result is a discrete approximation of a continuous distribution.
Figure~\ref{fig07-02} shows what it looks like.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig07-02.pdf}}
\caption{A gamma prior distribution of goal-scoring rate.}
\label{fig07-02}
\end{figure}
This distribution represents our prior knowledge about goal scoring: $\lam$ is usually less than 2, occasionally as high as 6, and seldom higher than that. And the mean is about 1.4.
As usual, reasonable people could disagree about the details of the prior, but this is good enough to get started.
Let's do an update.
\section{Update}
Now that we have a prior, the next step is to compute the likelihood of the data.
For France, the data is the number of goals scored, 4.
We can use the Poisson distribution to compute the likelihoods:
\begin{code}
$\lam$s = prior.qs
k = 4
likelihood = poisson.pmf(k, $\lam$s)
\end{code}
The result is a NumPy array with the likelihood of the data for each hypothetical value of $\lam$.
So we can do the update like this:
\begin{code}
def update_poisson(pmf, data):
k = data
$\lam$s = pmf.qs
likelihood = poisson.pmf(k, $\lam$s)
pmf *= likelihood
pmf.normalize()
\end{code}
The first parameter is the prior; the second is the number of goals.
We can use this function to compute posterior distributions for France and Croatia:
\begin{code}
france = prior.copy()
update_poisson(france, 4)
croatia = prior.copy()
update_poisson(croatia, 2)
\end{code}
Figure~\ref{fig07-03} shows the results.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig07-03.pdf}}
\caption{}
\label{fig07-03}
\end{figure}
Recall that the mean of the prior distribution is 1.4.
After Croatia scores 2 goals, their posterior mean is 1.7, which is near the midpoint of the prior and the date.
Likewise after France scores 4 goals, their posterior mean is 2.7.
These results are typical of a Bayesian update: the location of the posterior distribution is a compromise between the prior and the data.
\section{Probability of Superiority}
Now that we have a posterior distribution for each team, we can answer the first question: How confident should we be that France is the better team?
In the model, ``better'' means having a higher goal-scoring rate against the opponent.
We can use the posterior distributions to compute the probability that a random value drawn from France's distribution exceeds a value drawn from Croatia's.
One way to do that is to enumerate all pairs of values from the two distributions, adding up the total probability that one value exceeds the other, as in this function:
\begin{code}
def prob_gt(pmf1, pmf2):
total = 0
for q1, p1 in pmf1.items():
for q2, p2 in pmf2.items():
if q1 > q2:
total += p1 * p2
return total
\end{code}
This is similar to the method we use in Section~\ref{addends} to compute the distribution of a sum.
Here's how we use it:
\begin{code}
prob_gt(france, croatia)
\end{code}
\py{Pmf} provides a function that does the same thing, which we can call like this:
\begin{code}
Pmf.prob_gt(france, croatia)
\end{code}
The result is close to 75\%. So, on the basis of this game, we are reasonably confident that France is the better team.
Of course, we should remember that this result is based on the assumption that the goal-scoring rate is constant.
In reality, if a team is down by one goal, they might play more aggressively toward the end of the game, making them more likely to score, but also more likely to give up an additional goal.
As always, the results are only as good as the model.
\section{The distribution of goals}
Now we can take on the second question: If the same teams played again, what is the chance France would win the rematch?
To answer this question, we'll generate a {\bf posterior predictive distribution} for each team, which is the number of goals we expect them to score.
If we knew the goal scoring rate, $\lam$, the distribution of goals would be a Poisson distribution with parameter $\lam$.
Since we don't know $\lam$, the distribution of goals is a mixture of a Poisson distributions with different values of $\lam$.
First I'll generate a sequence of Poisson distributions, one for each hypothetical value of $\lam$:
\begin{code}
pmf_seq = [make_poisson_pmf($\lam$, 12) for $\lam$ in prior.qs]
\end{code}
Now we can use \py{make_mixture} from Section~\ref{mixture} to compute posterior predictive distributions for France and Croatia:
\begin{code}
pred_france = make_mixture(france, pmf_seq)
pred_croatia = make_mixture(croatia, pmf_seq)
\end{code}
Figure~\ref{fig07-04} shows posterior predictive distributions for the number of goals in a rematch.
\begin{figure}
\centerline{\includegraphics[width=5.5in]{figs/fig07-04.pdf}}
\caption{Posterior predictive distributions for the number of goals in a rematch.}
\label{fig07-04}
\end{figure}
These distributions represent two sources of uncertainty: we don't know the actual value of $\lam$, and even if we did, we would not know the number of goals in the next game.
We can use these distributions to compute the probability that France wins, loses, or ties the rematch:
\begin{code}
win = Pmf.prob_gt(pred_france, pred_croatia)
lose = Pmf.prob_lt(pred_france, pred_croatia)
tie = Pmf.prob_eq(pred_france, pred_croatia)
\end{code}
Assuming that France wins half of the ties, their chance of winning the rematch is about 65\%.
This is a bit lower than their probability of superiority, which is 75\%. And that makes sense even if they are better team, they might lose the game.
\section{The Exponential Distribution}
\label{exponential}
As an exercise at the end of this chapter, you'll have a chance to work on this variation on the World Cup Problem:
\begin{quote}
In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark.
At that point in the match, how many goals would you expect Germany to score after 90 minutes?
What was the probability that they would score 5 more goals (as, in fact, they did)?
\end{quote}
In this version, notice that the data is not the number of goals in a fixed period of time but the time between goals.
To compute the likelihood of data like this, we can use the theory of Poisson processes again.
In our model of a soccer game, we assume that each team has a goal-scoring rate, $\lam$, in goals per game.
And we assume that $\lam$ is constant, so the chance of scoring a goal in the same at any moment of the game.
Under these assumptions, the time between goals follows an exponential distribution (see \url{https://en.wikipedia.org/wiki/Exponential_distribution}).
If the goal-scoring rate is $\lam$, the probability of seeing an interval between goals of $t$ is proportional to the PDF of the exponential distribution:
$f(t; \lam) = \lam \exp(-\lam t)$
Because $t$ is a continuous quantity, the value of this expression is not a probability; it is a probability density.
However, it is proportional to the probability of the data, so we can use it as a likelihood in a Bayesian update.
The following function computes this PDF:
\begin{code}
def expo_pdf(t, $\lam$):
return $\lam$ * np.exp(-$\lam$ * t)
\end{code}
To see what exponential distributions look like, let's assume again that $\lam$ is 1.4; we can compute the distribution of $t$ like this:
\begin{code}
$\lam$ = 1.4
qs = np.linspace(0, 4, 101)
ps = expo_pdf(qs, $\lam$)
pmf_time = Pmf(ps, qs)
pmf_time.normalize()
\end{code}
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig07-05.pdf}}
\caption{An exponential distribution with $\lam = 1.4$.}
\label{fig07-05}
\end{figure}
Figure~\ref{fig07-05} shows the result.
It is counterintuitive, but true, that the most likely time to score a goal is immediately. After that, the probability of each possible interval is a little lower.
With a goal-scoring rate of 1.4, it is possible that a team will take more than one game to score a goal, but it is unlikely that they will take more than two games.
\section{Summary}
\section{Exercises}
The code for this chapter is in \py{chap07.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap07.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise}
Finish off the exercise from Section~\ref{exponential}:
\begin{quote}
In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark.
At that point in the match, how many goals would you expect Germany to score after 90 minutes?
What was the probability that they would score 5 more goals (as, in fact, they did)?
\end{quote}
\end{exercise}
\begin{exercise}
\end{exercise}
\begin{exercise}
In the 2010-11 National Hockey League (NHL) Finals, my beloved Boston
Bruins played a best-of-seven championship series against the despised
Vancouver Canucks. Boston lost the first two games 0-1 and 2-3, then
won the next two games 8-1 and 4-0. At this point in the series, what
is the probability that Boston will win the next game, and what is
their probability of winning the championship?
To choose a prior distribution, I got some statistics from
\url{http://www.nhl.com}, specifically the average goals per game
for each team in the 2010-11 season. The distribution well modeled by a gamma distribution with mean 2.8.
\index{National Hockey League}
\index{NHL}
\index{hockey}
\index{Boston Bruins}
\index{Vancouver Canucks}
\end{exercise}
\begin{exercise}
If buses arrive at a bus stop every 20 minutes, and you
arrive at the bus stop at a random time, your wait time until
the bus arrives is uniformly distributed from 0 to 20 minutes.
\index{bus stop problem}
But in reality, there is variability in the time between
buses. Suppose you are waiting for a bus, and you know the historical
distribution of time between buses. Compute your distribution
of wait times.
Hint: Suppose that the time between buses is either
5 or 10 minutes with equal probability. What is the probability
that you arrive during one of the 10 minute intervals?
I solve a version of this problem in the next chapter.
\end{exercise}
\begin{exercise}
Suppose that passengers arriving at the bus stop are well-modeled
by a Poisson process with parameter $\lam$. If you arrive at the
stop and find 3 people waiting, what is your posterior distribution
for the time since the last bus arrived.
\index{Poisson process}
\index{bus stop problem}
I solve a version of this problem in the next chapter.
\end{exercise}
\begin{exercise}
Suppose that you are an ecologist sampling the insect population in
a new environment. You deploy 100 traps in a test area and come back
the next day to check on them. You find that 37 traps have been
triggered, trapping an insect inside. Once a trap triggers, it
cannot trap another insect until it has been reset.
\index{insect sampling problem}
If you reset the traps and come back in two days, how many traps
do you expect to find triggered? Compute a posterior predictive
distribution for the number of traps.
\index{predictive distribution}
\end{exercise}
\begin{exercise}
Suppose you are the manager of an apartment building with
100 light bulbs in common areas. It is your responsibility
to replace light bulbs when they break.
\index{light bulb problem}
On January 1, all 100 bulbs are working. When you inspect
them on February 1, you find 3 light bulbs out. If you
come back on April 1, how many light bulbs do you expect to
find broken?
In the previous exercise, you could reasonably assume that an event is
equally likely at any time. For light bulbs, the likelihood of
failure depends on the age of the bulb. Specifically, old bulbs
have an increasing failure rate due to evaporation of the filament.
This problem is more open-ended than some; you will have to make
modeling decisions. You might want to read about the Weibull
distribution
(\url{http://en.wikipedia.org/wiki/Weibull_distribution}).
Or you might want to look around for information about
light bulb survival curves.
\index{Weibull distribution}
\end{exercise}
\chapter{Decision Analysis}
\label{decisionanalysis}
In this chapter....
... we estimate the price of prizes on a game show.
Once we compute a posterior distribution, we'll use it to optimize a decision-making process.
This example demonstrates the real power of Bayesian methods, not just computing posterior distributions, but using them to make better decisions.
\section{The {\it Price is Right} problem}
On November 1, 2007, contestants named Letia and Nathaniel appeared
on {\it The Price is Right}, an American game show. They competed in
a game called {\it The Showcase}, where the objective is to guess the price
of a showcase of prizes. The contestant who comes closest to the
actual price of the showcase, without going over, wins the prizes.
\index{Price is Right}
\index{Showcase}
Nathaniel went first. His showcase included a dishwasher, a wine
cabinet, a laptop computer, and a car. He bid \$26,000.
Letia's showcase included a pinball machine, a video arcade game, a
pool table, and a cruise of the Bahamas. She bid \$21,500.
The actual price of Nathaniel's showcase was \$25,347. His bid
was too high, so he lost.
The actual price of Letia's showcase was \$21,578. She was only
off by \$78, so she won her showcase and, because
her bid was off by less than \$250, she also won Nathaniel's
showcase.
For a Bayesian thinker, this scenario suggests several questions:
\begin{enumerate}
\item Before seeing the prizes, what prior beliefs should the
contestant have about the price of the showcase?
\item After seeing the prizes, how should the contestant update
those beliefs?
\item Based on the posterior distribution, what should the
contestant bid?
\end{enumerate}
The third question demonstrates a common use of Bayesian analysis:
decision analysis. Given a posterior distribution, we can choose
the bid that maximizes the contestant's expected return.
\index{decision analysis}
This problem is inspired by an example in Cameron Davidson-Pilon's
book, {\it Probablistic Programming and Bayesian Methods for Hackers}
(see \url{http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers}).
\index{Davidson-Pilon, Cameron}
\section{The prior}
To choose a prior distribution of prices, we can take advantage
of data from previous episodes.
Fortunately, fans of the show keep detailed records (see \url{https://web.archive.org/web/20121107204942/http://www.tpirsummaries.8m.com/}).
For this example, I downloaded files containing the price of each showcase from the 2011 and 2012 seasons and the bids offered by the contestants.
This dataset contains the prices for 313 previous showcases, which we can think of as a sample from the population of possible prices.
We can use this sample to estimate the prior distribution of showcase prices.
One way to do that is {\bf kernel density estimation} (KDE), which uses the sample to estimate a smooth distribution.
SciPy provides \py{gaussian_kde}, which takes a sample and returns an object that represents the estimated distribution.
\index{kernel density estimation}
\index{KDE}
The following function takes a sample, makes a KDE, evaluates it at a given sequence of quantities, and returns the result as a normalized \py{Pmf}:
\begin{code}
from scipy.stats import gaussian_kde
def make_kde(qs, sample):
kde = gaussian_kde(sample)
ps = kde(qs)
pmf = Pmf(ps, qs)
pmf.normalize()
return pmf
\end{code}
We can use it to estimate the distribution of total price for Showcase 1:
\begin{code}
qs = np.linspace(0, 80000, 81)
prior1 = make_kde(qs, df['Showcase 1'])
\end{code}
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig08-01.pdf}}
\caption{Distribution of total price for Showcase 1}
\label{fig08-01}
\end{figure}
Figure~\ref{fig08-01} shows the estimated distribution.
The most common price is around
\$28,000, but there might be a second mode near \$50,000.
If you were a contestant on the
show, you could use this distribution to quantify your prior belief
about the price of each showcase (before you see the prizes).
here is the PDF of a Gaussian distribution with
mean 0 and standard deviation 1:
\[ f(x) = \frac{1}{\sqrt{2 \pi}} \exp(-x^2/2) \]
\section{Modeling the contestants}
When the contestants see the prizes, they get information they can use to update their beliefs.
To do that, we have to answer these questions:
\begin{enumerate}
\item What data should we consider and how should we quantify it?
\item Can we compute a likelihood function; that is,
for each hypothetical value of \py{price}, can we compute
the conditional likelihood of the data?
\end{enumerate}
To answer these questions, I model the contestant
as a price-guessing instrument with known error characteristics.
In other words, when the contestant sees the prizes, they
guess the price of each prize---ideally without taking into
consideration the fact that the prize is part of a showcase---and
add up the prices. Let's call this total \py{guess}.
\index{error}
Under this model, the question we have to answer is, ``If the
actual price is \py{price}, what is the likelihood that the
contestant's estimate would be \py{guess}?''
\index{likelihood}
Or if we define \py{error = price - guess}, we can ask, ``What is the likelihood that the contestant's estimate is off by \py{error}?''
To answer this question, I'll use the historical data again.
For each showcase in the dataset, let's look at the difference between the contestant's bid and the actual price:
\begin{code}
sample_diff1 = df['Bid 1'] - df['Showcase 1']
sample_diff2 = df['Bid 2'] - df['Showcase 2']
\end{code}
To visualize the distribution of these differences, we can use KDE again.
\begin{code}
qs = np.linspace(-40000, 20000, 61)
kde_diff1 = make_kde(qs, sample_diff1)
kde_diff2 = make_kde(qs, sample_diff2)
\end{code}
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig08-02.pdf}}
\caption{Distribution of differences for the two contestants.}
\label{fig08-02}
\end{figure}
Figure~\ref{fig08-02} shows the results.
It looks like the bids are too low more often than too high, which makes sense.
Remember that under the rules of the game, you lose if you overbid, so contestants probably underbid to some degree deliberately.
We can use the observed distribution of differences to model the contestant's distribution of errors.
This step is a little tricky because we don't actually know the contestant's guesses; we only know what they bid.
So we have to make some assumptions:
\begin{enumerate}
\item I'll assume that contestants underbid because they are being strategic, and that on average their guesses are accurate. In other words, the mean of their errors is 0.
\item But I'll assume that the spread of the differences reflects the actual spread of their errors. So, I'll use the standard deviation of the differences as the standard deviation of their errors.
\end{enumerate}
Based on these assumptions, I'll make a normal distribution with mean 0 and standard deviation \py{std_diff1}:
\begin{code}
from scipy.stats import norm
error_dist1 = norm(0, std_diff1)
\end{code}
The result is an object that represents the distribution of errors for Player 1.
Among other things, this object can compute the PDF of a normal distribution, which we will use in the next section.
\index{normal distribution}
This model is not perfect because contestants' bids are sometimes strategic; for example, if Player 2 thinks that Player 1
has overbid, Player 2 might make a very low bid.
In that case \py{diff} does not reflect \py{error}.
If this happens a lot, the observed variance in \py{diff} might overestimate the variance in \py{error}.
Nevertheless, I think it is a reasonable modeling decision.
As an alternative, someone preparing to appear on the show could
estimate their own distribution of \py{error} by watching previous shows
and recording their guesses and the actual prices.
\section{Update}
Now we are ready to do the update.
Suppose you are Player 1. You see the prizes in your showcase and your estimate of the total price is \$23,000.
For each hypothetical price in the prior distribution, I'll subtract away your guess.
The result is your error under each hypothesis.
\begin{code}
guess1 = 23000
qs = prior1.index
error1 = guess1 - qs
\end{code}
Now suppose you know based on past performance that your estimation error is well modeled by \py{error_dist1}.
Under that assumption we can compute the likelihood of your estimate under each hypothesis.
\begin{code}
likelihood1 = error_dist1.pdf(error1)
\end{code}
And we can use that likelihood to update the prior.
\begin{code}
posterior1 = prior1 * likelihood1
posterior1.normalize()
\end{code}
Figure~\ref{fig08-03} shows this posterior distribution along with the prior.
Because your estimate is in the lower end of the range, the posterior distribution has shifted to the left.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig08-03.pdf}}
\caption{Prior and posterior distributions for Player 1.}
\label{fig08-03}
\end{figure}
Based on the prior mean, before you saw the prizes you expected to see a showcase with a value close to \$30,000.
After making an estimate of \$23,000, you updated the prior distribution.
Based on the combination of the prior and your estimate, you now expect the actual price to be about \$26,000.
On one level, this result makes sense.
The posterior mean is near the midpoint of your estimate and the prior mean.
On another level, you might find this result strange because it
suggests that if you {\em think} the price is \$23,000, then you
should {\em believe} the price is \$26,000.
To resolve this apparent paradox, remember that you are combining two
sources of information, historical data about past showcases and
guesses about the prizes you see.
We are treating the historical data as the prior and updating it
based on your guesses, but we could equivalently use your guess
as a prior and update it based on historical data.
If you think of it that way, maybe it is less surprising that the
most likely value in the posterior is not your original guess.
\section{Strategy}
Now that we have a posterior distribution, let's think about strategy.
\section{Probability of Winning}
First, from the point of view of Player 1, let's compute the probability that Player 2 overbids.
To keep it simple, I'll use only the performance of past players, ignoring the estimated price of the showcase.
The following function takes a sequence of past bids and returns the fraction that overbid.
\begin{code}
def prob_overbid(sample_diff):
return np.mean(sample_diff > 0)
\end{code}
In the dataset, Player 2 overbids about 30\% of the time.
Now suppose Player 1 underbids by \$5000.
What is the probability that Player 2 underbids by more?
The following function uses past performance to estimate the probability that a player underbids by more than a given amount, \py{diff}:
\begin{code}
def prob_worse_than(diff, sample_diff):
return np.mean(sample_diff < diff)
\end{code}
Player 2 underbids by more than \$5000 about 40\% of the time.
We can combine these functions to compute the probability that Player 1 wins, given the difference between their bid and the actual price:
\begin{code}
def compute_prob_win(diff, sample_diff):
# if you overbid you lose
if diff > 0:
return 0
# if the opponent overbids, you win
p1 = prob_overbid(sample_diff)
# or of their bid is worse than yours, you win
p2 = prob_worse_than(diff, sample_diff)
return p1 + p2
\end{code}
Let's look at this from your point of view as a contestant.
\py{diff} is the difference between your bid and the actual price; if it's greater than 0, you overbid, so you lose.
\py{sample_diff} is a sample of differences for your opponent.
If they overbid (and you didn't) you win.
Otherwise, we have to see whose bid is closer, yours or your opponent's. If their bid is worse than yours, you win.
As an example, you can call it like this:
\begin{code}
compute_prob_win(-5000, sample_diff2)
\end{code}
If Player 1 underbids by \$5000, their chance of winning is about 67\%.
Now let's look at the probability of winning for a range of possible differences.
\begin{code}
xs = np.linspace(-30000, 5000, 121)
ys = [compute_prob_win(x, sample_diff2) for x in xs]
\end{code}
From the point of view of Player 1, Figure~\ref{fig08-04} shows the probability of winning as a function of the difference between their bid and the actual price.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig08-04.pdf}}
\caption{For Player 1, the probability of winning as a function of the difference between their bid and the actual price.}
\label{fig08-04}
\end{figure}
\section{Decision Analysis}
In the previous section we computed the probability of winning given that we have underbid by a particular amount.
In reality the contestants don't know how much they have underbid by because they don't know the actual price.
But they do have a posterior distribution that represents their beliefs about the actual price, and they can use that to estimate their probability of winning with a given bid.
The following function take a possible bid, a posterior distribution of actual prices, and a sample of differences for the opponent.
\begin{code}
def total_prob_win(bid, posterior, sample_diff):
total = 0
for price, prob in posterior.items():
diff = bid - price
total += prob * compute_prob_win(diff, sample_diff)
return total
\end{code}
It loops through the hypothetical prices in the posterior distribution and for each price:
\begin{enumerate}
\item Computes the difference between the bid and the hypothetical price.
\item Computes the probability that the player wins, given that difference.
\item Adds up the weighted sum of the probabilities, where the weights are the probabilities in the posterior distribution.
\end{enumerate}
This loop implements the law of total probability:
\[ \p{win} = \sum_{price} \p{price} ~ \p{win ~|~ price} \]
Now we can loop through a range of possible bids and compute the probability of winning:
\begin{code}
bids = posterior1.index
probs = [total_prob_win(bid, posterior1, sample_diff2)
for bid in bids]
\end{code}
For Player 1, Figure~\ref{fig08-05} shows the probability of winning as a function of their bid.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig08-05.pdf}}
\caption{For Player 1, the probability of winning as a function of their bid.}
\label{fig08-05}
\end{figure}
Recall that your estimate was \$23,000.
After using your estimate to compute the posterior distribution, the posterior mean is about \$26,000.
But the bid that maximizes your chance of winning is \$21,000; with that bid, the probability of winning is 52\%.
\section{Expected Gain}
In the previous section we computed the bid that maximizes your chance of winning.
And if that's your goal, the bid we computed is optimal.
But winning isn't everything.
Remember that if your bid is off by \$250 or less, you win both showcases.
So it might be a good idea to increase your bid a little: it increases the chance you overbid and lose, but it also increases the chance of winning both showcases.
Let's see how that works out.
The following function computes how much you will win, on average, given your bid, the actual price, and a sample of errors for your opponent.
\begin{code}
def compute_gain(bid, price, sample_diff):
diff = bid - price
prob = compute_prob_win(diff, sample_diff)
# if you are within 250 dollars, you win both showcases
if -250 <= diff <= 0:
return 2 * price * prob
else:
return price * prob
\end{code}
For simplicity, I assume that both showcases have the same value.
Since the probability of winning both showcases is small, the the effect of this simplification should be small.
As an example, if the actual price is \$35000
and you bid \$30000,
you will win about \$23,600 worth of prizes on average.
In reality we don't know the actual price, but we have a posterior distribution that represents what we know about it.
By averaging over the prices and probabilities in the posterior distribution, we can compute the {\bf expected gain} for a particular bid.
\begin{code}
def expected_gain(bid, posterior, sample_diff):
total = 0
for price, prob in posterior.items():
total += prob * compute_gain(bid, price, sample_diff)
return total
\end{code}
The first argument is your bid; the second is the posterior distribution that represents your belief about the price of the showcase; and \py{sample_diff} is a sample of differences for your opponent.
For the posterior we computed earlier, based on an estimate of \$23,000,
the expected gain for a bid of \$21,000
is about \$16,900.
But can we do better?
To find out, we can loop through a range of bids and find the one that maximizes expected gain.
\begin{code}
bids = posterior1.index
gains = [expected_gain(bid, posterior1, sample_diff2) for bid in bids]
expected_gain_series = pd.Series(gains, index=bids)
\end{code}
Figure~\ref{fig08-06} shows expected gain for a range of possible bids.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig08-06.pdf}}
\caption{Expected gain for a range of possible bids.}
\label{fig08-06}
\end{figure}
Recall that the estimated value of the prizes is \$23,000 and the bid that maximizes the chance of winning is \$21,000.
The bid that maximizes your expected gain is \$22,000; with that bid, your expected gain is about \$17,400.
\section{Discussion}
One of the features of Bayesian estimation is that the
result comes in the form of a posterior distribution. Classical
estimation usually generates a single point estimate or a confidence
interval, which is sufficient if estimation is the last step in the
process, but if you want to use an estimate as an input to a
subsequent analysis, point estimates and intervals are often not much
help.
\index{distribution}
In this example, we use the posterior distribution
to compute an optimal bid. The return on a given bid is asymmetric
and discontinuous (if you overbid, you lose), so it would be hard to
solve this problem analytically. But it is relatively simple to do
computationally.
\index{decision analysis}
Newcomers to Bayesian thinking are often tempted to summarize the
posterior distribution by computing the mean or the maximum
likelihood estimate. These summaries can be useful, but if that's
all you need, then you probably don't need Bayesian methods in the
first place.
\index{maximum likelihood}
\index{summary statistic}
Bayesian methods are most useful when you can carry the posterior
distribution into the next step of the analysis to perform some
kind of decision analysis, as we did in this chapter, or some kind of
prediction, as we see in the next chapter.
\section{Exercises}
The code for this chapter is in \py{chap08.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap08.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise}
Following the instructions in the notebook, replicate the analysis in this chapter from the point of view of Player 2.
\end{exercise}
\begin{exercise}
This exercise is inspired by a true story. In 2001 I created Green Tea Press to publish my books, starting with {\tt Think Python}.
I ordered 100 copies from a short-run printer and made the book available for through a distributor. After the first week, the distributor reported that 12 copies were sold. Based that report, I thought I would run out of copies in about 8 weeks, so I got ready to order more. My printer offered me a discount if I ordered more than 1000 copies, so I went a little crazy and ordered 2000 copies. A few days later, my mother called to tell me that her copies of the book had arrived. Surprised, I asked how many ``copies''. She said ten.
It turned out I had sold only two copies to non-relatives. And it took a lot longer than I expected to sell 2000 copies.
The details of this story are unique, but the general problem is something almost every retailer has to figure out. Based on past sales, how do you predict future sales? And based on those predictions, how do you decide how much to order and when?
Often the cost of a bad decision is complicated. If you place a lot of small orders rather than one big one, your costs are likely to be higher. If you run out of inventory, you might lose customers. And if you order too much, you have to pay the various costs of holding inventory.
So, let's solve a version of the problem I faced. Suppose you start selling books online. During the first week you sell 12 copies (and let's assume that none of the customers are your mother). During the second week you sell 8 copies.
Assuming that the arrival of orders is a Poisson process, we can think of the weekly orders as samples from a Poisson distribution with an unknown rate.
Choose a prior you think is appropriate and use the data to compute the posterior distribution of the order rate.
Then generate a posterior predictive distribution for the number of copies you expect during the next 8 weeks.
\begin{itemize}
\item Suppose the cost of printing the book is \$5 per copy,
\item But if you order 100 or more, it's \$4.50 per copy.
\item For every book you sell, you get \$10.
\item But if you run out of books before the end of 8 weeks, you lose \$50 in future sales for every week you are out of stock.
\item If you have books left over at the end of 8 weeks, you lose \$2 in inventory costs per extra book.
\end{itemize}
For example, suppose you get orders for 10 books per week, every week.
If you order 60 books,
\begin{itemize}
\item The total cost is \$300.
\item You sell all 80 books, so you make \$600.
\item But the book is out of stock for two weeks, so you lose \$100 in future sales.
\end{itemize}
In total, your profit is \$200.
If you order 100 books,
\begin{itemize}
\item The total cost is \$450.
\item You sell 80 books, again, so you make \$800.
\item But you have 20 books left over at the end, so you lose \$40.
\end{itemize}
In total, your profit is \$310.
Combining these costs with your predictive distribution, how many books should you order to maximize your expected profit?
In the notebook for this chapter, I provide some code to get you started.
\end{exercise}
\chapter{Comparisons}
\label{comparison}
The Elo rating system is a way to quantify the skill of players for games like chess (see \url{https://en.wikipedia.org/wiki/Elo_rating_system}).
It is based on a model of the relationship between the ratings of players and the outcome of a game.
Specifically, if $R_A$ is the rating of player \py{A} and $R_B$ is the rating of player \py{B}, the probability that \py{A} beats \py{B} is given by the logistic function (see \url{https://en.wikipedia.org/wiki/Logistic_function}):
$\p{\mathrm{A~beats~B}} = 1 / (1 + 10^{(R_B-R_A)/400})$
The parameters $10$ and $400$ are arbitrary choices that determine the range of the ratings. In chess, the range is from 100 to 2800.
Suppose \py{A} has a current rating of 1600 and \py{B} has a current rating of 1800.
Then \py{A} and \py{B} play and \py{A} wins. How should we update their ratings?
In this chapter I will solve a simpler version of this question; then you will have a chance to finish it off as an exercise.
This chapter introduces {\tt joint distributions}, which represent the distributions of two or more variables and the relationships among them.
We'll extend the Bayesian update process we've seen in previous chapter and apply it to a joint distribution.
But first I will introduce a tool we will use to construct joint distributions and compute likelihoods: outer operations.
\section{Outer operations}
\label{outer-operations}
Many useful operations can be expressed in the form of an {\bf outer operation} of two sequences.
Suppose you have sequences like \py{t1} and \py{t2}:
\begin{code}
t1 = [1,3,5]
t2 = [2,4]
\end{code}
The most common outer operation is the outer product, which computes the product of every pair of values, one from each sequence.
For example, here is the outer product of \py{t1} and \py{t2}:
\begin{code}
a = np.multiply.outer(t1, t2)
\end{code}
The result is a NumPy array, but it's easier to understand what it is if I put it in a DataFrame:
\begin{code}
df = pd.DataFrame(a, index=t1, columns=t2)
\end{code}
Here's the result:
\input{tables/table09-02}
The values from \py{t1} appear along the rows; the values from \py{t2} appear along the columns.
Each element in the array is the product of an element from \py{t1} and an element from \py{t2}.
The outer sum is similar, except that each element is the {\em sum} of an element from \py{t1} and an element from \py{t2}.
\begin{code}
a = np.add.outer(t1, t2)
df = pd.DataFrame(a, index=t1, columns=t2)
\end{code}
Here's the result:
\input{tables/table09-02}
These outer operations work with Python lists and tuples, and NumPy arrays, but not Pandas \py{Series}.
So I'll use the following function, which takes two Pandas \py{Series} and puts the result into a \py{DataFrame}.
\begin{code}
def outer_product(s1, s2):
a = np.multiply.outer(s1.to_numpy(), s2.to_numpy())
return pd.DataFrame(a, index=s1.index, columns=s2.index)
\end{code}
It might not be obvious yet why these operations are useful, but we'll see some examples soon.
With that, we are ready to take on a new Bayesian problem.
\section{How tall is A?}
Suppose I choose two people from the population of adult males in the United States, and call them A and B. If we see that A taller than B, how tall is A?
To answer this question:
\begin{enumerate}
\item I'll use background information about the height of men in the U.S. to form a prior distribution of height,
\item I'll construct a joint distribution of height for A and B (and I'll explain what that is);
\item Then I'll update the prior with the information that A is taller, and
\item From the posterior joint distribution I'll extract the posterior distribution of height for A.
\end{enumerate}
In the U.S. the average height of male adults in 178 cm and the standard deviation is 7.7 cm. The distribution is not exactly normal, because nothing in the real world is, but the normal distribution is a pretty good model of the actual distribution, so we can use it as a prior distribution for A and B.
Here's an array of equally-spaced values from roughly 3 standard deviations below the mean to 3 standard deviations above.
\begin{code}
mean = 178
std = 7.7
qs = np.arange(mean-24, mean+24, 0.5)
\end{code}
SciPy provides a function called \py{norm} that represents a normal distribution with a given mean and standard deviation, and provides \py{pdf}, which evaluates the normal probability distribution function (PDF), which we will use as the prior probabilities.
\begin{code}
from scipy.stats import norm
ps = norm(mean, std).pdf(qs)
\end{code}
I'll store the \py{ps} and \py{qs} in a \py{Pmf} that represents the prior distribution.
\begin{code}
prior = Pmf(ps, qs)
prior.normalize()
\end{code}
This distribution represents what we believe about the heights of \py{A} and \py{B} before we take into account the data that \py{A} is taller.
\section{Joint distribution}
The next step is to construct a distribution that represents the probability of every pair of heights, which is called a joint distribution.
The elements of the joint distribution are
$\p{A_y~\mathrm{and}~B_x}$
which is the probability that \py{A} is $y$ cm tall and \py{B} is $x$ cm tall, for all values of $y$ and $x$.
At this point all we know about \py{A} and \py{B} is that they are male residents of the U.S., so their heights are independent; that is, knowing the height of \py{A} provides no additional information about the height of \py{B}.
In that case, we can compute the joint probabilities like this:
$\p{A_y~\mathrm{and}~B_x} = \p{A_y}~\p{B_x}$
Each joint probability is the product of one element from the distribution for \py{A} and one element from the distribution for \py{B}.
So we can compute the joint distribution using \py{outer_product}:
\begin{code}
joint = outer_product(prior, prior)
joint.shape
\end{code}
The result is a \py{DataFrame} with possible heights of \py{A} along the rows, heights of \py{B} along the columns, and the joint probabilities as elements.
The following function uses \py{pcolormesh} to plot the joint distribution.
\begin{code}
def plot_joint(joint):
plt.pcolormesh(joint.columns, joint.index, joint)
plt.colorbar()
decorate(ylabel='A height in cm',
xlabel='B height in cm')
\end{code}
Recall that \py{outer_product} puts the values of \py{A} along the rows and the values of \py{B} across the columns.
Figure~\ref{fig09-01} shows the results.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig09-01.pdf}}
\caption{Joint prior distribution of height for A and B.}
\label{fig09-01}
\end{figure}
As you might expect, the probability is highest near the mean and drops off away from the mean.
\section{Likelihood}
Now that we have a joint prior distribution, we can update it with the data, which is that \py{A} is taller than \py{B}.
Each element in the joint distribution represents a hypothesis about the heights of \py{A} and \py{B}; for example:
\begin{enumerate}
\item The element \py{(180, 170)} represents the hypothesis that \py{A} is 180 cm tall and \py{B} is 170 cm tall. Under this hypothesis, the probability that \py{A} is taller than \py{B} is 1.
\item The element \py{(170, 180)} represents the hypothesis that \py{A} is 170 cm tall and \py{B} is 180 cm tall. Under this hypothesis, the probability that \py{A} is taller than \py{B} is 0.
\end{enumerate}
To compute the likelihood of every pair of values, we can extract the quantities from the joint prior, like this:
\begin{code}
Y = joint.index.to_numpy()
X = joint.columns.to_numpy()
\end{code}
And then apply the \py{outer} version of \py{np.subtract}, which computes the difference between every element of \py{Y} (height of \py{A}) and every element of \py{X} (height of \py{B}).
\begin{code}
diff = np.subtract.outer(Y, X)
\end{code}
The result is an array of differences. To compute likelihoods, we use \py{np.where} which puts \py{1} where the \py{diff} is greater than 0 and 0 elsewhere.
\begin{code}
a = np.where(diff>0, 1, 0)
\end{code}
The result is an array of likelihoods, which I will put in a \py{DataFrame} with the values of \py{Y} in the index and the values of \py{X} in the columns.
\begin{code}
likelihood = pd.DataFrame(a, index=Y, columns=X)
\end{code}
Figure~\ref{fig09-02} shows the likelihood that A is taller than B for each hypothetical pair of heights.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig09-02.pdf}}
\caption{Likelihood that A is taller than B for each hypothetical pair of heights.}
\label{fig09-02}
\end{figure}
We have a prior, we have a likelihood, and we are ready for the update.
\section{The update}
As usual, the unnormalized posterior is the product of the prior and the likelihood.
\begin{code}
posterior = joint * likelihood
\end{code}
I'll use the following function to normalize the posterior:
\begin{code}
def normalize(joint):
prob_data = joint.to_numpy().sum()
joint /= prob_data
\end{code}
We have to convert the \py{DataFrame} to a NumPy array before calling \py{sum}. Otherwise, \py{DataFrame.sum} would compute the sums of the columns and return a \py{Series}.
Now we can normalize the posterior:
\begin{code}
normalize(posterior)
\end{code}
Figure~\ref{fig09-03} shows the result.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig09-03.pdf}}
\caption{Joint posterior distribution of height for A and B.}
\label{fig09-03}
\end{figure}
For all hypotheses where \py{A} is not taller than \py{B}, the posterior probability is 0.
\section{The marginals}
\label{marginals}
The joint posterior distribution represents what we believe about the heights of \py{A} and \py{B}, given the prior distributions and the information that \py{A} is taller.
From this joint distribution, we can compute posterior distributions for \py{A} and \py{B}. To see how, let's start with a simpler problem.
Suppose we want to know the probability that \py{B} is 180 cm tall. We can select the column from the joint distribution where \py{X=180}.
\begin{code}
column = posterior[180]
\end{code}
This column contains posterior probabilities for all cases where \py{X=180}; if we add them up, we get the total probability that \py{B} is 180 cm tall.
\begin{code}
column.sum()
\end{code}
Now, to get the posterior distribution of height for \py{B}, we can add up all of the columns, like this:
\begin{code}
column_sums = posterior.sum(axis=0)
\end{code}
The argument \py{axis=0} means we want to sum the elements along the rows; that is, we want to add up the columns.
The result is a \py{Series} that contains every possible height for \py{B} and its probability. In other words, it is the distribution of heights for \py{B}.
We can put it in a \py{Pmf} like this:
\begin{code}
marginal_B = Pmf(column_sums)
\end{code}
When we extract the distribution of a single variable from a joint distribution, the result is called a {\bf marginal distribution}.
The name comes from a common visualization that shows the joint distribution in the middle and the marginal distributions in the margins.
Similarly, we can get the posterior distribution of height for \py{A} by adding up the rows and putting the result in a \py{Pmf}.
\begin{code}
row_sums = posterior.sum(axis=1)
marginal_A = Pmf(row_sums)
\end{code}
The following function takes a joint distribution and an axis number, and returns a marginal distribution.
\begin{code}
def marginal(joint, axis):
return Pmf(joint.sum(axis=axis))
\end{code}
So we can compute the marginal distributions like this.
\begin{code}
marginal_B = marginal(posterior, axis=0)
marginal_A = marginal(posterior, axis=1)
\end{code}
Figure~\ref{fig09-04} shows what they look like.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig09-04.pdf}}
\caption{Prior and posterior distributions for A and B.}
\label{fig09-04}
\end{figure}
As you might expect, the posterior distribution for \py{A} is shifted to the right and the posterior distribution for \py{B} is shifted to the left.
Based on the observation that \py{A} is taller than \py{B}, we are inclined to believe that \py{A} is a little taller than average, and \py{B} is a little shorter.
Notice that the posterior distributions are a little narrower than the prior.
The standard deviations of the posterior distributions are a little smaller, which means we are a little more certain about the heights of \py{A} and \py{B} after we compare them.
\section{Conditional posteriors}
Now suppose we measure \py{B} and find that he is 180 cm tall. What does that tell us about \py{A}?
In the joint distribution, each column corresponds a possible height for \py{B}. We can select the column that corresponds to height 180 cm like this:
\begin{code}
column_180 = posterior[180]
\end{code}
The result is a \py{Series} that represents possible heights for \py{A} and their relative likelihoods.
These likelihoods are not normalized, but we can normalize them like this:
\begin{code}
cond_A = Pmf(column_180)
cond_A.normalize()
\end{code}
The result is the {\bf conditional distribution} of height for \py{A} given that \py{B} is 180 cm tall.
Figure~\ref{fig09-05} shows what it looks like.
Note that when we make a \py{Pmf} it copies the data by default, so we can modify \py{cond_A} without affecting \py{column_180} or \py{posterior}.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig09-05.pdf}}
\caption{.}
\label{fig09-05}
\end{figure}
The conditional distribution is cut off at 180 cm, because we have established that \py{A} is taller than \py{B} and \py{B} is 180 cm.
\section{Dependence and independence}
When we constructed the joint prior distribution, I said that the heights of \py{A} and \py{B} were independent, which means that knowing one of them provides no information about the other.
In other words, the conditional probability $\p{A_y | B_x}$ is the same as the unconditioned probability $\p{A_y}$.
That's why we can compute an element of the joint prior, $\p{A_y~\mathrm{and}~B_x}$, by rewriting it in terms of conditional probability, $\p{B_x}~\p{A_y~|~B_x}$, and using the independence of $A$ and $B$ to replace the conditional probability.
Putting it all together, we have
$\p{A_y~\mathrm{and}~B_x} = \p{B_x}~\p{A_y}$
But remember, that's only true if $A$ and $B$ are independent.
In the posterior distribution, they are not.
We know that \py{A} is taller than \py{B}, so if we know how tall \py{B} is, that gives us information about \py{A}.
The conditional distribution we just computed demonstrates this dependence.
\section{Summary}
In this chapter I started with the ``outer'' operations, like outer product, which we used to construct a joint distribution.
In general, you cannot construct a joint distribution from two marginal distributions, but in the special case where the distributions are independent, you can.
We extended the Bayesian update process we've seen in previous chapters and applied it to a joint distribution. Then from the posterior joint distribution we extracted posterior marginal distributions and posterior conditional distributions.
As an exercise, you'll have a chance to apply the same process to a slightly more difficult problem, updating Elo ratings based on the outcome of a chess game.
\section{Exercises}
The code for this chapter is in \py{chap09.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap09.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise}
Based on the results of the previous example, compute the posterior conditional distribution for \py{B} given that \py{A} is 190 cm.
\end{exercise}
\begin{exercise}
Suppose we have established that \py{A} is taller than \py{B}, but we don't know how tall \py{B} is.
Now we choose a random woman, \py{C}, and find that she is shorter than \py{A} by at least 15 cm. Compute posterior distributions for the heights of \py{A} and \py{C}.
The average height for women in the U.S. is 163 cm; the standard deviation is 7.3 cm.
\end{exercise}
\begin{exercise}
At the beginning of this chapter, I introduced
the Elo rating system, which is used to quantify the skill level of players for games like chess.
It is based on a model of the relationship between the ratings of players and the outcome of a game. Specifically, if $R_A$ is the rating of player \py{A} and $R_B$ is the rating of player \py{B}, the probability that \py{A} beats \py{B} is given by the logistic function:
$\p{\mathrm{A~beats~B}} = 1 / (1 + 10^{(R_B-R_A)/400})$
Suppose \py{A} has a current rating of 1600, but we are not sure it is accurate. We could describe their true rating with a normal distribution with mean 1600 and standard deviation 100, to indicate our uncertainty.
And suppose \py{B} has a current rating of 1800, with the same level of uncertainty.
Then \py{A} and \py{B} play and \py{A} wins. How should we update their ratings?
To answer this question:
\begin{enumerate}
\item Construct prior distributions for \py{A} and \py{B}.
\item Use them to construct a joint distribution, assuming that the prior distributions are independent.
\item Use the logistic function above to compute the likelihood of the outcome under each joint hypothesis.
\item Use the joint prior and likelihood to compute the joint posterior.
\item Extract and plot the marginal posteriors for \py{A} and \py{B}.
\item Compute the posterior means for \py{A} and \py{B}. How much should their ratings change based on this outcome?
\end{enumerate}
\end{exercise}
\chapter{Classification}
\label{classification}
Classification might be the most well-known application of Bayesian
methods, made famous as the basis of the first generation of spam
filters in the 1990s (see \url{https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering}).
In this chapter, I'll demonstrate Bayesian classification using data
collected and made available by Dr.~Kristen Gorman at the Palmer
Long-Term Ecological Research Station in Antarctica. We'll use this data
to classify penguins by species.
This dataset was published to support this article: Gorman, Williams,
and Fraser, ``Ecological
Sexual Dimorphism and Environmental Variability within a Community of
Antarctic Penguins (Genus \emph{Pygoscelis})'', March 2014, which you can read at \url{https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081}.
The dataset contains one row for each penguin and one column for each
variable, including the measurements we will use for classification.
We can read it into a \py{DataFrame} like this:
\begin{code}
df = pd.read_csv('penguins_raw.csv')
\end{code}
Three species of penguins are represented in the dataset: Adelie,
Chinstrap and Gentoo.
The measurements we'll use to classify them are:
\begin{itemize}
\item
Body Mass in grams (g).
\item
Flipper Length in millimeters (mm).
\item
Culmen Length in millimeters.
\item
Culmen Depth in millimeters.
\end{itemize}
If you are not familiar with the word ``culmen'', it refers to the
top margin of the beak (see \url{https://en.wikipedia.org/wiki/Bird_measurement\#Culmen}).
\section{Distributions of measurements}
\label{distributions-of-measurements}
These measurements will be most useful for classification if there are
substantial differences between species and small variation within
species. To see whether that is true, and to what degree, I will plot
cumulative distribution functions (CDFs) of each measurement for each
species.
The following function takes the \py{DataFrame} and
a column name, and returns a dictionary that maps from each species name
to a \py{Cdf} of the values in the given column.
\begin{code}
def make_cdf_map(df, varname, by='Species2'):
cdf_map = {}
grouped = df.groupby(by)[varname]
for species, group in grouped:
cdf_map[species] = Cdf.from_seq(group, name=species)
return cdf_map
\end{code}
Figure~\ref{fig10-01} shows
\begin{figure}
\centerline{\includegraphics[width=5.5in]{figs/fig10-01.pdf}}
\caption{}
\label{fig10-01}
\end{figure}
It looks like we can use culmen length to identify Adelie penguins, but
the distributions for the other two species almost entirely overlap.
Using flipper length, we can distinguish Gentoo penguins from the other
two species. So with just these two features, it seems like we should be
able to classify penguins with some accuracy.
Culmen depth and body mass distinguish Gentoo penguins from the other
two species, but these features might not add a lot of additional
information, beyond flipper length and culmen length.
All of these CDFs show the sigmoid shape characteristic of the normal
distribution; I will take advantage of that observation in the next
section.
\section{Normal models}
\label{normal-models}
Now let's use these features to classify penguins. I'll proceed in the
usual Bayesian way:
\begin{enumerate}
\item
I'll define a prior distribution that represents a hypothesis for each
species and a prior probability.
\item
I'll compute the likelihood of the data under each hypothesis, and
then
\item
Compute the posterior probability of each hypothetical species.
\end{enumerate}
To compute the likelihood of the data under each hypothesis, I will use
the data to estimate the parameters of a normal distribution for each
feature and each species.
The following function takes a \py{DataFrame} and a
column name; it returns a dictionary that maps from each species name to
a \py{norm} object. \py{norm}
is defined in SciPy; it represents a normal distribution with a given
mean and standard deviation.
\begin{code}
from scipy.stats import norm
def make_norm_map(df, varname, by='Species2'):
norm_map = {}
grouped = df.groupby(by)[varname]
for species, group in grouped:
mean = group.mean()
std = group.std()
norm_map[species] = norm(mean, std)
return norm_map
\end{code}
For example, here's how we estimate the distributions of flipper length
for the three species.
\begin{code}
flipper_map = make_norm_map(df, 'Flipper Length (mm)')
\end{code}
As usual I will use a \py{Pmf} to represent the
prior distribution. For simplicity, I'll assume that the three species
are equally likely.
\begin{code}
hypos = flipper_map.keys()
prior = Pmf(1/3, hypos)
prior
\end{code}
Now suppose we measure a penguin and find that its flipper is 210 cm.
What is the probability of that measurement under each hypothesis?
The \py{norm} object provides
\py{pdf}, which computes the probability density
function (PDF) of the normal distribution. We can use it to compute the
likelihood of the observed data in a given distribution.
\begin{code}
data = 210
flipper_map['Adelie'].pdf(data)
\end{code}
The result is a probability density, so we can't interpret it as a
probability. But it is proportional to the likelihood of the data, so we
can use it to update the prior.
Here's how we compute the likelihood of the data in each distribution.
\begin{code}
likelihood = [flipper_map[hypo].pdf(data) for hypo in hypos]
\end{code}
Now we can do the update in the usual way.
\begin{code}
posterior = prior * likelihood
posterior.normalize()
\end{code}
And here are the results:
\input{tables/table10-01}
A penguin with a 210 mm flipper has an 80\% chance of being a Gentoo and
about an 19\% chance of being a Chinstrap (assuming that the three
species were equally likely before the measurement).
The following function encapsulates the steps we just ran. It takes a
\py{Pmf} representing the prior distribution, the
observed data, and a map from each hypothesis to the distribution of the
feature.
\begin{code}
def update_penguin(prior, data, norm_map):
hypos = prior.qs
likelihood = [norm_map[hypo].pdf(data) for hypo in hypos]
posterior = prior * likelihood
posterior.normalize()
return posterior
\end{code}
The return value is the posterior distribution.
As we saw in the CDFs, flipper length does not distinguish strongly
between Adelie and Chinstrap penguins. If a penguin has a 190 mm
flipper, it is almost certainly not a Gentoo, but it is almost equally
likely to be Adelie or Chinstrap.
\begin{code}
posterior2 = update_penguin(prior, 190, flipper_map)
\end{code}
But culmen length \emph{can} make this distinction. We can estimate
distributions of culmen length for each species like this:
\begin{code}
culmen_map = make_norm_map(df, 'Culmen Length (mm)')
\end{code}
A penguin with culmen length 38 mm is almost certainly an Adelie.
\begin{code}
posterior3 = update_penguin(prior, 38, culmen_map)
\end{code}
With culmen length 48 mm, it is probably not an Adelie, but it's about
equally likely to be a Chinstrap or Gentoo.
\begin{code}
posterior4 = update_penguin(prior, 48, culmen_map)
\end{code}
Using one feature at a time, sometimes we can classify penguins with
high confidence; sometimes we can't. We can do better using multiple
features.
\section{Naive Bayesian classification}
\label{naive-bayesian-classification}
To make it easier to do multiple updates, I'll use the following
function, which takes a prior \py{Pmf}, sequence of
measurements and a corresponding sequence of dictionaries containing
estimated distributions.
\begin{code}
def update_naive(prior, data_seq, norm_maps):
posterior = prior.copy()
for data, norm_map in zip(data_seq, norm_maps):
posterior = update_penguin(posterior, data, norm_map)
return posterior
\end{code}
The return value is a posterior \py{Pmf}.
I'll use the same features we looked at in the previous section: culmen
length and flipper length.
\begin{code}
varnames = ['Culmen Length (mm)', 'Flipper Length (mm)']
norm_maps = [culmen_map, flipper_map]
\end{code}
Now suppose we find a penguin with culmen length 48 mm and flipper
length 210 mm. Here's the update:
\begin{code}
data_seq = 48, 210
posterior = update_naive(prior, data_seq, norm_maps)
\end{code}
It's most likely to be a Gentoo.
I'll loop through the dataset and classify each penguin with these two
features.
\begin{code}
df['Classification'] = np.nan
for i, row in df.iterrows():
data_seq = row[varnames]
posterior = update_naive(prior, data_seq, norm_maps)
df.loc[i, 'Classification'] = posterior.max_prob()
\end{code}
The result is a new column in the \py{DataFrame}.
So let's see how many we got right.
There are 344 penguins in the dataset, but two of them are missing
measurements, so we have 342 valid cases.
Of those, 324 are classified correctly, which is almost 95\%.
The classifier we used in this section is called ``naive'' because it
ignores correlations between the features. To see why that matters, I'll
make a less naive classifier: one that takes into account the joint
distribution of the features.
\section{Joint distributions}
\label{joint-distributions}
Let's see what the joint distribution looks like.
I'll start by making a scatter plot of the data.
\begin{code}
def scatterplot(df, var1, var2):
grouped = df.groupby('Species2')
for species, group in grouped:
plt.plot(group[var2], group[var1], 'o',
alpha=0.4, label=species)
decorate(ylabel=var1, xlabel=var2)
\end{code}
Figure~\ref{fig01-02} shows a scatter plot of culmen length and flipper length for the three
species.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig10-02.pdf}}
\caption{}
\label{fig01-02}
\end{figure}
Within each species, there is a clear correlation between culmen length
and flipper length.
If we ignore these correlations, we are assuming that the features are
independent. To see what that looks like, I'll make a joint distribution
for each species assuming independence.
The following function makes a discrete \py{Pmf}
that approximates a normal distribution.
It takes a \py{norm} object as a parameter; \py{sigmas} is the number of standard deviations to include above and below the mean; \py{n} is the number of points in the result.
\begin{code}
def make_pmf(dist, sigmas=3, n=101):
mean, std = dist.mean(), dist.std()
low = mean - sigmas * std
high = mean + sigmas * std
qs = np.linspace(low, high, n)
ps = dist.pdf(qs)
pmf = Pmf(ps, qs)
pmf.normalize()
return pmf
\end{code}
We can use it, along with \py{outer_product} from Section~\ref{outer-operations}, to make a joint distribution of culmen length and
flipper length for each species.
\begin{code}
joint_map = {}
for species in hypos:
pmf1 = make_pmf(culmen_map[species])
pmf2 = make_pmf(flipper_map[species])
joint_map[species] = outer_product(pmf1, pmf2)
\end{code}
And we can use the joint distribution to generate a contour plot.
\begin{code}
def plot_contour(joint, **options):
plt.contour(joint.columns, joint.index, joint, **options)
\end{code}
Figure~\ref{fig10-03} compares the data to joint distributions that
assume independence.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig10-03.pdf}}
\caption{}
\label{fig10-03}
\end{figure}
The contours of a joint normal distribution form ellipses.
In this example, because the features are uncorrelated, the ellipses are
aligned with the axes. But they are not well aligned with the data.
We can make a better model of the data, and use it to compute better
likelihoods, with a multivariate normal distribution.
\section{Multivariate normal distribution}
\label{multivariate-normal-distribution}
As we have seen, a univariate normal distribution is characterized by
its mean and standard deviation or variance (where variance is the
square of standard deviation).
A multivariate normal distribution is characterized by the means of the
features and the \textbf{covariance matrix}, which contains the
variances, which quantify the spread of the features, and the
covariances, which quantify the relationships among them.
We can use the data to estimate the means and covariance matrix for the
population of penguins. First I'll select the columns we want.
\begin{code}
features = df[[var1, var2]]
features.head()
\end{code}
And compute the means.
\begin{code}
mean = features.mean()
mean
\end{code}
\begin{code}
# convert to a DataFrame and write as a table
mean_df = pd.DataFrame(mean, columns=['mean'])
write_table(mean_df, 'table10-04')
\end{code}
The result is a \py{Series} containing the mean
culmen length and flipper length.
We can also compute the covariance matrix:
\begin{code}
cov = features.cov()
write_table(cov, 'table10-05')
cov
\end{code}
The results is a \py{DataFrame} with one row and
one column for each feature. The elements on the diagonal are the
variances; the elements off the diagonal are covariances.
SciPy provides a \py{multivariate_normal} object
we can use to represent a multivariate normal distribution. It takes a
sequence of means and a covariance matrix as parameters:
\begin{code}
from scipy.stats import multivariate_normal
multinorm = multivariate_normal(mean, cov)
multinorm
\end{code}
The following function makes a
\py{multivariate_normal} object for each species.
\begin{code}
def make_multinorm_map(df, varnames):
multinorm_map = {}
grouped = df.groupby('Species2')
for species, group in grouped:
features = group[varnames]
mean = features.mean()
cov = features.cov()
multinorm_map[species] = multivariate_normal(mean, cov)
return multinorm_map
\end{code}
And here's how we use it.
\begin{code}
multinorm_map = make_multinorm_map(df, [var1, var2])
\end{code}
In the next section we'll see what the multivariate normal distribution
looks like.
Then we'll use them to classify penguins, and we'll see if the results
are more accurate than the naive Bayesian classifier.
\section{Visualizing a multivariate normal distribution}
\label{visualizing-a-multivariate-normal-distribution}
This section uses some NumPy magic to generate contour plots for
multivariate normal distributions. If that's interesting for you, great!
Otherwise, feel free to skip to the results. In the next section we'll
do the actual classification, which turns out to be easier than the
visualization.
I'll start by making a contour map for the distribution of features
among Adelie penguins.\\
Here are the univariate distributions for the two features we'll use and
the multivariate distribution we just computed.
\begin{code}
norm1 = culmen_map['Adelie']
norm2 = flipper_map['Adelie']
multinorm = multinorm_map['Adelie']
\end{code}
I'll make a discrete \py{Pmf} approximation for
each of the univariate distributions.
\begin{code}
pmf1 = make_pmf(norm1)
pmf2 = make_pmf(norm2)
\end{code}
And use them to make a mesh that contains all pairs of values.
\begin{code}
X, Y = np.meshgrid(pmf1.qs, pmf2.qs)
\end{code}
The mesh is represented by two arrays, one containing the values along
the $x$ axis, the other containing the values along the $y$ axis.
In order to evaluate the multivariate distribution for each pair of
values, we have to ``stack'' the arrays.
\begin{code}
pos = np.dstack((X, Y))
\end{code}
The result is a 3-D array that you can think of as a 2-D array of pairs.
When we pass this array to \py{multinorm.pdf}, it
evaluates the probability density function of the distribution for each
pair of values.
\begin{code}
a = multinorm.pdf(pos)
\end{code}
The result is an array of probability densities. If we put them in a
\py{DataFrame} and normalize them, the result is a
discrete approximation of the joint distribution of the two features.
\begin{code}
joint = pd.DataFrame(a, index=pmf1.qs, columns=pmf2.qs)
normalize(joint)
\end{code}
Which we can plot with \py{plot_contour}:
\begin{code}
plot_contour(joint)
\end{code}
Figure~\ref{fig10-04} shows a scatter plot of the data along with the
contours of the multivariate normal distribution for each species.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig10-04.pdf}}
\caption{}
\label{fig10-04}
\end{figure}
The contours of a multivariate normal distribution are still ellipses,
but now that we have taken into account the correlation between the
features, the ellipses are no longer aligned with the axes.
Because it takes the correlations into account, the multivariate normal
distribution is a better model for the data. And there is less overlap
in the contours of the three distributions, which suggests that they
should yield better classifications.
\section{A less naive classifier}
\label{a-less-naive-classifier}
In a previous section we used \py{update_penguin}
to update a prior \py{Pmf} based on observed data
and a collection of \py{norm} objects that model
the distribution of observations under each hypothesis. Here it is
again:
\begin{code}
def update_penguin(prior, data, norm_map):
hypos = prior.qs
likelihood = [norm_map[hypo].pdf(data) for hypo in hypos]
posterior = prior * likelihood
posterior.normalize()
return posterior
\end{code}
I wrote this function with \py{norm} objects in
mind, but it also works if the distributions in
\py{norm_map} are
\py{multivariate_normal} objects. So we can call
it like this:
\begin{code}
data = 38, 190
update_penguin(prior, data, multinorm_map)
\end{code}
A penguin with culmen length 38 and flipper length 190 is almost
certainly an Adelie.
\begin{code}
data = 48, 195
update_penguin(prior, data, multinorm_map)
\end{code}
A penguin with culmen length 48 and flipper length 195 is almost
certainly a Chinstrap.
\begin{code}
data = 48, 215
update_penguin(prior, data, multinorm_map)
\end{code}
And a penguin with culmen length 48 and flipper length 215 is almost
certainly a Gentoo.
Let's see if this classifier does any better than the naive Bayesian
classifier. I'll apply it to each penguin in the dataset:
\begin{code}
df['Classification'] = np.nan
for i, row in df.iterrows():
data = row[varnames]
posterior = update_penguin(prior, data, multinorm_map)
df.loc[i, 'Classification'] = posterior.idxmax()
\end{code}
And compute the accuracy:
\begin{code}
accuracy(df)
\end{code}
It turns out to be only a little better: the accuracy is 95.3\%,
compared to 94.7\% for the naive Bayesian classifier.
In one way, that's disappointing. After all that work, it would have
been nice to see a bigger difference.
But in another way, it's good news. In general, a naive Bayesian
classifier is easier to implement and requires less computation. If it
works nearly as well as a more complex algorithm, it might be a good
choice for practical purposes.
But speaking of practical purposes, you might have noticed that this
example isn't very useful. If we want to identify the species of a
penguin, there are easier ways than measuring its flippers and beak.
However, there is are valid scientific uses for this type of
classification. One of them is the subject of the research paper we
started with:
\url{https://en.wikipedia.org/wiki/Sexual_dimorphism}{sexual
dimorphism}, that is, differences in shape between male and female
animals.
In some species, like angler fish, males and females look very
different. In other species, like mockingbirds, they are difficult to
tell apart. And dimorphism is worth studying because it provides insight
into social behavior, sexual selection, and evolution.
One way to quantify the degree of sexual dimorphism in a species is to
use a classification algorithm like the one in this chapter. If you can
find a set of features that makes it possible to classify individuals by
sex with high accuracy, that's evidence of high dimorphism.
As an exercise, you can use the dataset from this chapter to classify
penguins by sex and see which of the three species is the most
dimorphic.
\section{Exercises}
The code for this chapter is in \py{chap10.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap10.ipynb}.
The notebook provides space where you can work on the following problems.
\begin{exercise} In my example I used culmen length and flipper length
because they seemed to provide the most power to distinguish the three
species. But maybe we can do better by using more features.
Make a naive Bayesian classifier that uses all four measurements in the
dataset: culmen length and depth, flipper length, and body mass. Is it
more accurate than the model with two features?
\end{exercise}
\begin{exercise}
One of the reasons the penguin dataset was collected
was to quantify sexual dimorphism in different penguin species, that is,
physical differences between male and female penguins. One way to
quantify dimorphism is to use measurements to classify penguins by sex.
If a species is more dimorphic, we expect to be able to classify them
more accurately.
As an exercise, pick a species and use a Bayesian classifier (naive or
not) to classify the penguins by sex. Which features are most useful?
What accuracy can you achieve?
\end{exercise}
\chapter{Inference}
Whenever people compare the Bayesian inference with conventional
approaches, one of the questions that comes up most often is something
like, ``What about p-values?'' And one of the most common examples is
the comparison of two groups to see if there is a difference in their
means.
In classical statistical inference, the usual tool for this scenario is
a (\url{https://en.wikipedia.org/wiki/Student\%27s_t-test}) Student's
\textit{t}-test, and the result is a
(\url{https://en.wikipedia.org/wiki/P-value}) p-value. This process is
an example of``null
hypothesis significance testing''.
A Bayesian alternative is to compute the posterior distribution of the
difference between the groups. Then we can use that distribution to
answer whatever questions we are interested in, including the most
likely size of the difference, a credible interval that's likely to
contain the true difference, the probability of superiority, or the
probability that the difference exceeds some threshold.
To demonstrate this process, I'll solve a standard problem from a
statistical textbook, comparing the effect of an educational
``treatment'' compared to a control.
\section{Improving Reading Ability}
We'll use data from a
(\url{https://docs.lib.purdue.edu/dissertations/AAI8807671/})
Ph.D.~dissertation in educational psychology written in 1987, which was used as an example
in a
(\url{https://books.google.com/books/about/Introduction_to_the_practice_of_statisti.html?id=pGBNhajABlUC})
statistics textbook from 1989 and published on
(\url{https://web.archive.org/web/20000603124754/http://lib.stat.cmu.edu/DASL/Datafiles/DRPScores.html}) DASL,
a web page that collects data stories.
Here's the description from DASL:
\begin{quote}
An educator conducted an experiment to test whether new directed reading
activities in the classroom will help elementary school pupils improve
some aspects of their reading ability. She arranged for a third grade
class of 21 students to follow these activities for an 8-week period. A
control classroom of 23 third graders followed the same curriculum
without the activities. At the end of the 8 weeks, all students took a
Degree of Reading Power (DRP) test, which measures the aspects of
reading ability that the treatment is designed to improve.
\end{quote}
The data are in the repository for this book.
I'll use Pandas to load the data into a \py{DataFrame}:
\begin{code}
import pandas as pd
df = pd.read_csv('drp_scores.csv', skiprows=21, delimiter='\t')
\end{code}
And \py{groupby} to separate the data for the
\py{Treated} and \py{Control}
groups:
\begin{code}
grouped = df.groupby('Treatment')
responses = {}
for name, group in grouped:
responses[name] = group['Response']
\end{code}
Figure~\ref{fig11-01} shows the cumulative distributions of the scores for the two groups, and here are their summary statistics.
\begin{stdout}
Group n mean std
----- -- ---- ---
Control 23 41.5 17.1
Treated 21 51.5 11.0
\end{stdout}
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig11-01.pdf}}
\caption{CDF of test scores for treated group and control group.}
\label{fig11-01}
\end{figure}
The distribution of scores is not exactly normal for either group, but
it is close enough that the normal model is a reasonable choice.
So I'll assume that in the entire population of students (not just the
ones in the experiment), the distribution of scores is well modeled by a
normal distribution with unknown mean and standard deviation. I'll use
\py{mu} and \py{sigma} to
denote these unknown population parameters.
And we'll do a Bayesian update to estimate what they are.
\section{Estimating parameters}
As always, we need a prior distribution for the parameters.
Since there are two parameters, it will be a joint distribution.
I'll construct it by choosing marginal distributions for each parameter
and computing their outer product.
As a simple starting place, I'll assume that the prior distributions for
\py{mu} and \py{sigma} are
uniform.
\begin{code}
mus = np.linspace(20, 80, 101)
prior_mu = Pmf(1, mus, name='mean')
sigmas = np.linspace(5, 30, 101)
prior_sigma = Pmf(1, sigmas, name='std')
\end{code}
Assuming that the parameters are independent, we can use \py{outer_product} from Section~\ref{outer-operations} to construct the joint prior distribution.
\begin{code}
from utils import outer_product
prior = outer_product(prior_mu, prior_sigma)
\end{code}
Now, we would like to know the probability of each score in the dataset
for each hypothetical pair of values, \py{mu} and
\py{sigma}. I'll do that by making a 3-dimensional
grid with values of \py{sigma} on the first axis,
values of \py{mu} on the second axis, and the
scores from the control group on the third axis.
\begin{code}
data = responses['Control']
sigmas, mus, data_mesh = np.meshgrid(prior.columns,
prior.index,
data)
\end{code}
Now we can use \py{norm.pdf} to compute the
probability density of each score for each hypothetical pair of
parameters.
\begin{code}
from scipy.stats import norm
densities = norm.pdf(data_mesh, sigmas, mus)
\end{code}
The result is a 3-D array. To compute likelihoods, I'll compute the
product of these densities along the third axis, that is,
\py{axis=2}:
\begin{code}
likelihood = densities.prod(axis=2)
likelihood.shape
\end{code}
The result is a 2-D array that contains the likelihood of the entire
dataset for each hypothetical pair of parameters.
We can use this array as part of a Bayesian update, as in this function:
\begin{code}
from utils import normalize
def update_norm(prior, data):
X, Y, Z = np.meshgrid(prior.columns, prior.index, data)
likelihood = norm.pdf(Z, Y, X).prod(axis=2)
posterior = prior * likelihood
normalize(posterior)
return posterior
\end{code}
Here are the updates for the control and treatment groups:
\begin{code}
data = responses['Control']
posterior_control = update_norm(prior, data)
data = responses['Treated']
posterior_treated = update_norm(prior, data)
\end{code}
Figure~\ref{fig11-02} shows what the joint posterior distributions look like.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig11-02.pdf}}
\caption{Joint posterior distributions for the treated and control groups.}
\label{fig11-02}
\end{figure}
Along the vertical axis, it looks like the mean score for the treated
group is higher. Along the horizontal axis, it looks like the standard
deviation for the control group is higher.
If we think the treatment causes these differences, the data suggest
that the treatment increases the mean score and decreases their spread.
We can see these differences more clearly by looking at the marginal
distributions for \py{mu} and
\py{sigma}.
\section{Posterior marginal distributions}
I'll use \py{marginal}, which we saw in Section~\ref{marginals},
to extract the posterior marginal distributions for the population means.
\begin{code}
from utils import marginal
pmf_mean_control = marginal(posterior_control, 1)
pmf_mean_treated = marginal(posterior_treated, 1)
\end{code}
Figure~\ref{fig11-03} shows what they look like.
It seems like we are pretty sure that the population mean in the treated
group is higher.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig11-03.pdf}}
\caption{}
\label{fig11-03}
\end{figure}
We can use \py{prob_gt} to
compute the probability of superiority:
\begin{code}
Pmf.prob_gt(pmf_mean_treated, pmf_mean_control)
\end{code}
There is a 98\% chance that the mean in the treated group is higher.
We can use \py{sub_dist} to compute the
distribution of the difference.
\begin{code}
diff = Pmf.sub_dist(pmf_mean_treated, pmf_mean_control)
\end{code}
But there are two things to be careful about when we use methods like
\py{sub_dist}.
The first is that the result usually contains more elements than the
original \py{Pmf}.
In this example, the original distributions have the same quantities, so
the size increase is moderate.
But in the worst case, the size of the result can be the product of the
sizes of the originals.
The other thing to be aware of is that plotting a
\py{Pmf} does not always work well. In this
example, if we plot the distribution of differences, the result is
pretty noisy.
There are two ways to work around that limitation. One is to plot the
CDF, which smooths out the noise.
The other option is to use kernel density estimation (KDE) to make a
smooth approximation of the PDF on an equally-spaced grid.
The following function takes a \py{Pmf} and the number of points on the grid, and returns a smooth \py{Pmf}, ready for plotting.
\begin{code}
from scipy.stats import gaussian_kde
def make_kde(pmf, n=101):
kde = gaussian_kde(pmf.qs, weights=pmf.ps)
qs = np.linspace(pmf.qs.min(), pmf.qs.max(), n)
ps = kde.evaluate(qs)
pmf = Pmf(ps, qs)
pmf.normalize()
return pmf
\end{code}
Figure~\ref{fig11-04} shows what it looks like.
The mean is almost 10 points, which is substantial.
Finally, we can use \py{credible_interval} to
compute a 90\% credible interval.
\begin{code}
diff.credible_interval(0.9)
\end{code}
Based on the data, we are pretty sure the treatment improves test scores
by 2.4 to 17.4 points.
\section{Using summary statistics}
In this example the dataset is not very big, so it doesn't take too long
to compute the probability of every score under every hypothesis. But
the result is a 3-D array; for larger datasets, it might be too big to
compute practically.
Also, with larger datasets the likelihoods get very small, sometimes so
small that we can't compute them with normal floating-point arithmetic.
That's because we are computing the probability of a particular dataset;
the number of possible datasets is astronomically big, so the
probability of any of them is very small.
An alternative is to compute a summary of the dataset and compute the
likelihood of the summary. For example, if we compute the sample mean of
the data and the sample standard deviation, we could compute the
likelihood of those summary statistics under each hypothesis.
As an example, suppose we know that the population mean is 40 and the
standard deviation is 17. We can make a \py{norm}
object that represents a normal distribution with these parameters:
\begin{code}
mu = 40
sigma = 17
dist = norm(mu, sigma)
\end{code}
Now suppose we draw 1000 samples from this distribution with sample size
\py{n=20}. I'll use \py{rvs},
which generates a random sample, to simulate this experiment.
\begin{code}
n = 20
samples = dist.rvs((1000, n))
samples.shape
\end{code}
The result is an array with 1000 rows, each containing a sample with 20
columns.
If we compute the mean of each row, the result is an array that contains
1000 sample means; that is, each value is the mean of a sample with
\py{n=20}.
\begin{code}
sample_means = samples.mean(axis=1)
sample_means.shape
\end{code}
Now, we would like to know what the distribution of these sample means
is. Using the properties of the normal distribution,
(\url{https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables}) we
can show that their distribution is normal with mean $\mu$ and
standard deviation $\sigma/\sqrt{n}$:
\begin{code}
dist_m = norm(mu, sigma/np.sqrt(n))
\end{code}
\py{dist_m} represents the ``sampling distribution
of the mean''.
In the notebook for this chapter, you'll see that the random sample means follow the theoretical
distribution closely, as expected.
We can also compute standard deviations for each row in
\py{samples}.
\begin{code}
sample_stds = samples.std(axis=1)
sample_stds.shape
\end{code}
The result is an array of sample standard deviations. We might wonder
what the distribution of these values is. The
(\url{https://en.wikipedia.org/wiki/Normal_distribution\#Sample_variance}) derivation
is not as easy, but if we transform the sample standard deviations like
this:
$t = n s^2 / \sigma^2$
where $n$ is the sample size, $s$ is the sample standard deviation,
and $\sigma$ is the population standard deviation, the transformed
values follow a
(\url{https://en.wikipedia.org/wiki/Chi-square_distribution}) chi-square
distribution with $n-1$ degrees of freedom.
Here are the transformed values.
\begin{code}
transformed = n * sample_stds**2 / sigma**2
\end{code}
And I'll create a \py{chi2} object that represents
a chi-square distribution.
\begin{code}
from scipy.stats import chi2
dist_s = chi2(n-1)
\end{code}
In the notebook you'll see that the distribution of transformed sample standard deviations agrees with
the theoretical distribution.
I think it is useful to check theoretical results like this, for a few
reasons:
\begin{itemize}
\item
It confirms that my understanding of the theory is correct,
\item
It confirms that the conditions where I am applying the theory are
conditions where the theory holds,
\item
It confirms that the implementation details are correct. For many
distributions, there is more than one way to specify the parameters.
If you use the wrong specification, this kind of testing will help you
catch the error.
\end{itemize}
Before we move on, I'll mention one other theoretical result we will
use: (\url{https://en.wikipedia.org/wiki/Basu\%27s_theorem})
Basu's theorem, which states that the sample mean and sample standard
deviation are independent.
\section{Update with summary statistics}
Now we're ready to do an update. I'll compute summary statistics for the
two groups.
\begin{code}
summary = {}
for name, response in responses.items():
summary[name] = (len(response),
response.mean(),
response.std())
\end{code}
The result is a dictionary that maps from group name to a tuple that
contains the sample size, \py{n}, the sample mean,
\py{m}, and the sample standard deviation
\py{s}, for each group.
I'll demonstrate the update with the summary statistics from the control
group.
\begin{code}
n, m, s = summary['Control']
\end{code}
I'll make a mesh with hypothetical values of
\py{mu} on the vertical axis and values of
\py{sigma} on the horizontal axis.
\begin{code}
sigmas, mus = np.meshgrid(prior.columns, prior.index)
sigmas.shape
\end{code}
Now we can compute the likelihood of seeing the sample mean,
\py{m}, for each pair of parameters.
\begin{code}
like1 = norm.pdf(m, mus, sigmas/np.sqrt(n))
\end{code}
And use it to update the prior.
\begin{code}
posterior1 = prior * like1
normalize(posterior1)
\end{code}
Next we compute the likelihood of seeing the sample standard deviation, \py{s}, for each pair of parameters.
\begin{code}
like2 = chi2.pdf(n * s**2 / sigmas**2, n-1)
\end{code}
And here's the second update:
\begin{code}
posterior2 = posterior1 * like2
normalize(posterior2)
\end{code}
The following function does both updates, using the sample mean and
standard deviation.
\begin{code}
def update_norm_summary(prior, data):
n, m, s = data
sigmas, mus = np.meshgrid(prior.columns, prior.index)
like1 = norm.pdf(m, mus, sigmas/np.sqrt(n))
like2 = chi2.pdf(n * s**2 / sigmas**2, n-1)
posterior = prior * like1 * like2
normalize(posterior)
return posterior
\end{code}
Here are the updates for the two groups.
\begin{code}
data = summary['Control']
posterior_control2 = update_norm_summary(prior, data)
data = summary['Treated']
posterior_treated2 = update_norm_summary(prior, data)
\end{code}
You can see the results in the notebook for this chapter.
Visually, these posterior joint distributions are similar to the ones we
computed using the entire datasets, not just the summary statistics.
But they are not exactly the same, as we'll see by comparing the marginal
distributions.
\section{Comparing marginals}
Again, let's extract the marginal posterior distributions.
\begin{code}
pmf_mean_control2 = marginal(posterior_control2, 1)
pmf_mean_treated2 = marginal(posterior_treated2, 1)
\end{code}
And compare them to results we got using the entire dataset.
Figure~\ref{fig11-05} shows the results.
\begin{figure}
\centerline{\includegraphics[width=4in]{figs/fig11-05.pdf}}
\caption{}
\label{fig11-05}
\end{figure}
For both groups, the distribution of \py{mu} is a little wider when we use only the summary statistics; that is, we are a little less certain about the values of the means.
If we compute the posterior distribution of the difference in means,
the mean difference is nearly the same, but the credible interval is a bit wider.
That's because the update we did is based on the implicit assumption
that the distribution of the data is actually normal, but it's not.
As a result, when we replace the dataset with the summary statistics, we lose some information about the true distribution of the data. With less
information, we are less certain about the parameters.
\section{Summary}
In this chapter we used a joint distribution to represent prior
probabilities for the parameters of a normal distribution,
\py{mu} and \py{sigma}.
And we updated that distribution two ways: first using the entire
dataset and the normal PDF; then using summary statistics, the normal
PDF, and the chi-square PDF.
Using summary statistics is computationally more efficient, but it loses
some information in the process.
Normal distributions appear in many domains, as well as other
distributions that are well approximated by normal distributions. So the
methods in this chapter are broadly applicable. The exercises at the end
of the chapter will give you a chance to apply them.
\section{Exercises}
\begin{exercise}
Looking again at the posterior joint distribution of
\py{mu} and \py{sigma}, it
seems like the standard deviation of the treated group might be lower;
if so, that would suggest that the treatment is more effective for
students with lower scores.
But before we speculate too much, we should estimate the size of the
difference and see whether it might actually be 0.
As we did with the values of \py{mu} in the
previous section, extract the posterior marginal distributions of
\py{sigma} for the two groups. What is the
probability that the standard deviation is higher in the control group?
Compute the distribution of the difference in
\py{sigma} between the two groups. What is the mean
of this difference? What is the 90\% credible interval?
\end{exercise}
\begin{exercise}
An ``effect size'' is a statistic intended to quantify the magnitude of a phenomenon (see \url{http://en.wikipedia.org/wiki/Effect_size}).
If the phenomenon is a difference in means between two groups, a common way to quantify it is Cohen's effect size, denoted $d$.
If the parameters for Group 1 are $(\mu_1, \sigma_1)$, and the
parameters for Group 2 are $(\mu_2, \sigma_2)$, Cohen's
effect size is
\[ d = \frac{\mu_1 - \mu_2}{(\sigma_1 + \sigma_2)/2} \]
Use the joint posterior distributions for the two groups to compute the posterior distribution for Cohen's effect size.
Then compute the mean and 90\% credible interval.
Hint: if enumerating all pairs from the two distributions takes too
long, consider random sampling.
\end{exercise}
\begin{exercise}
This exercise is inspired by
(\url{https://www.reddit.com/r/statistics/comments/hcvl2j/q_reverse_empirical_distribution_rule_question/}) a
question that appeared on Reddit.
An instructor announces the results of an exam like this, ``The average
score on this exam was 81. Out of 25 students, 5 got more than 90, and I
am happy to report that no one failed (got less than 60).''
Based on this information, what do you think the standard deviation of
scores was?
You can assume that the distribution of scores is approximately normal.
And let's assume that the sample mean, 81, is actually the population
mean, so we only have to estimate \py{sigma}.
Hint: To compute the probability of a score greater than 90, you can use
\py{norm.sf}, which computes the survival function,
also known as the complementary CDF, or
\py{1 - cdf(x)}.
\end{exercise}
\begin{exercise}
I have a soft spot for crank science, so this
exercise is about the
\url{http://en.wikipedia.org/wiki/Variability_hypothesis}{Variability
Hypothesis}, which
\begin{quote}
``originated in the early nineteenth century with Johann Meckel, who
argued that males have a greater range of ability than females,
especially in intelligence. In other words, he believed that most
geniuses and most mentally retarded people are men. Because he
considered males to be the 'superior animal,' Meckel concluded that
females' lack of variation was a sign of inferiority.''
\end{quote}
I particularly like that last part because I suspect that if it turned
out that women were \emph{more} variable, Meckel would have taken that
as a sign of inferiority, too.
Nevertheless, the Variability Hypothesis suggests an exercise we can use
to practice the methods in this chapter. Let's look at the distribution
of heights for men and women in the U.S. and see who is more variable.
I used 2018 data from the CDC's
\url{https://www.cdc.gov/brfss/annual_data/annual_2018.html}{Behavioral
Risk Factor Surveillance System} (BRFSS), which includes self-reported
heights from 154407 men and 254722 women.
Here's what I found:
\begin{itemize}
\item
The average height for men is 178 cm; the average height for women is
163 cm. So men are taller on average; no surprise there.
\item
For men the standard deviation is 8.27 cm; for women it is 7.75 cm. So
in absolute terms, men's heights are more variable.
\end{itemize}
But to compare variability between groups, it is more meaningful to use
the
(\url{https://en.wikipedia.org/wiki/Coefficient_of_variation}) coefficient
of variation (CV), which is the standard deviation divided by the mean.
It is a dimensionless measure of variability relative to scale.
For men CV is 0.0465; for women it is 0.0475. The coefficient of
variation is higher for women, so this dataset provides evidence against
the Variability Hypothesis. But we can use Bayesian methods to make that
conclusion more precise.
Use these summary statistics to compute the posterior distribution of
\py{mu} and \py{sigma} for the
distributions of male and female height. Use
\py{Pmf.div_dist} to compute posterior
distributions of CV. Based on this dataset and the assumption that the
distribution of height is normal, what is the probability that the
coefficient of variation is higher for men? What is the most likely
ratio of the CVs and what is the 90\% credible interval for that ratio?
Hint: Use different prior distributions for the two groups, and chose
them so they cover all parameters with non-negligible probability.
\end{exercise}
\chapter{Observer Bias}
\label{observer}
\section{The Red Line problem}
In Massachusetts, the Red Line is a subway that connects
Cambridge and Boston. When I was working in Cambridge I took the Red
Line from Kendall Square to South Station and caught the commuter rail
to Needham. During rush hour Red Line trains run every 7--8
minutes, on average.
\index{Red Line problem}
\index{Boston}
When I arrived at the station, I could estimate the time until
the next train based on the number of passengers on the platform.
If there were only a few people, I inferred that I just missed
a train and expected to wait about 7 minutes. If there were
more passengers, I expected the train to arrive sooner. But if
there were a large number of passengers, I suspected that
trains were not running on schedule, so I would go back to the
street level and get a taxi.
While I was waiting for trains, I thought about how Bayesian
estimation could help predict my wait time and decide when I
should give up and take a taxi. This chapter presents the
analysis I came up with.
This chapter is based on a project by Brendan Ritter and
Kai Austin, who took a class with me at Olin College.
The code in this chapter is available from
\url{http://thinkbayes.com/redline.py}. The code I used
to collect data is in \url{http://thinkbayes.com/redline_data.py}.
For more information
see Section~\ref{download}.
\index{Olin College}
\section{The model}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/redline0.pdf}}
\caption{PMF of gaps between trains, based on collected data,
smoothed by KDE. \py{z} is the actual distribution; \py{zb}
is the biased distribution seen by passengers. }
\label{fig.redline0}
\end{figure}
Before we get to the analysis, we have to make some
modeling decisions. First, I will treat passenger arrivals as
a Poisson process, which means I assume that passengers are equally
likely to arrive at any time, and that they arrive at an unknown
rate, $\lam$, measured in passengers per minute. Since I
observe passengers during a short period of time, and at the same
time every day, I assume that $\lam$ is constant.
\index{Poisson process}
On the other hand, the arrival process for trains is not Poisson.
Trains to Boston are supposed to leave from the end of the line
(Alewife station) every 7--8 minutes during peak times, but by the time
they get to Kendall Square, the time between trains varies between 3
and 12 minutes.
To gather data on the time between trains, I wrote a script that
downloads real-time data from
\url{http://www.mbta.com/rider_tools/developers/}, selects south-bound
trains arriving at Kendall square, and records their arrival times
in a database. I ran the script from 4pm to 6pm every weekday
for 5 days, and recorded about 15 arrivals per day. Then
I computed the time between consecutive arrivals; the distribution
of these gaps is shown in Figure~\ref{fig.redline0}, labeled \py{z}.
If you stood on the platform from 4pm to 6pm and recorded the time
between trains, this is the distribution you would see. But if you
arrive at some random time (without regard to the train schedule) you
would see a different distribution. The average time
between trains, as seen by a random passenger, is substantially
higher than the true average.
Why? Because a passenger is more like to arrive during a
large interval than a small one. Consider a simple example:
suppose that the time between trains is either 5 minutes
or 10 minutes with equal probability. In that case
the average time between
trains is 7.5 minutes.
But a passenger is more likely to arrive during a 10 minute gap
than a 5 minute gap; in fact, twice as likely. If we surveyed
arriving passengers, we would find that 2/3 of them arrived during
a 10 minute gap, and only 1/3 during a 5 minute gap. So the
average time between trains, as seen by an arriving passenger,
is 8.33 minutes.
This kind of {\bf observer bias} appears in many contexts. Students
think that classes are bigger than they are because more of them are
in the big classes. Airline passengers think that planes are fuller
than they are because more of them are on full flights.
\index{observer bias}
In each case, values from the actual distribution are
oversampled in proportion to their value. In the Red Line example,
a gap that is twice as big is twice as likely to be observed.
So given the actual distribution of gaps, we can compute the
distribution of gaps as seen by passengers. \py{BiasPmf}
does this computation:
\begin{code}
def BiasPmf(pmf):
new_pmf = pmf.Copy()
for x, p in pmf.Items():
new_pmf.Mult(x, x)
new_pmf.Normalize()
return new_pmf
\end{code}
\py{pmf} is the actual distribution; \verb"new_pmf" is the
biased distribution. Inside the loop, we multiply the
probability of each value, \py{x}, by the likelihood it will
be observed, which is proportional to \py{x}. Then we
normalize the result.
Figure~\ref{fig.redline0} shows the actual distribution of gaps,
labeled \py{z}, and the distribution of gaps seen by passengers,
labeled \py{zb} for ``z biased''.
\section{Wait times}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/redline2.pdf}}
\caption{CDF of \py{z}, \py{zb}, and the wait time seen
by passengers, \py{y}. }
\label{fig.redline2}
\end{figure}
Wait time, which I call \py{y}, is the time between the arrival
of a passenger and the next arrival of a train. Elapsed time, which I
call \py{x}, is the time between the arrival of the previous
train and the arrival of a passenger. I chose these definitions
so that \py{zb = x + y}.
Given the distribution of \py{zb}, we can compute the distribution of
\py{y}. I'll start with a simple case and then generalize.
Suppose, as in the previous example, that \py{zb} is either 5 minutes
with probability 1/3, or 10 minutes with probability 2/3.
If we arrive at a random time during a 5 minute gap,
\py{y} is uniform from 0 to 5 minutes. If we arrive during a 10
minute gap, \py{y} is uniform from 0 to 10. So the overall
distribution is a mixture of uniform distributions weighted
according to the probability of each gap.
\index{uniform distribution}
The following function takes the distribution of \py{zb} and
computes the distribution of \py{y}:
\begin{code}
def PmfOfWaitTime(pmf_zb):
metapmf = thinkbayes.Pmf()
for gap, prob in pmf_zb.Items():
uniform = MakeUniformPmf(0, gap)
metapmf.Set(uniform, prob)
pmf_y = thinkbayes.MakeMixture(metapmf)
return pmf_y
\end{code}
\py{PmfOfWaitTime} makes a meta-Pmf that maps from each uniform
distribution to its probability. Then it uses \py{MakeMixture},
which we saw in Section~\ref{mixture}, to compute the mixture.
\index{mixture}
\index{MakeMixture}
\index{meta-Pmf}
\py{PmfOfWaitTime} also uses \py{MakeUniformPmf}, defined here:
\begin{code}
def MakeUniformPmf(low, high):
pmf = thinkbayes.Pmf()
for x in MakeRange(low=low, high=high):
pmf.Set(x, 1)
pmf.Normalize()
return pmf
\end{code}
\py{low} and \py{high} are the range of the uniform distribution,
(both ends included). Finally, \py{MakeUniformPmf} uses {\tt
MakeRange}, defined here:
\begin{code}
def MakeRange(low, high, skip=10):
return range(low, high+skip, skip)
\end{code}
\py{MakeRange} defines a set of possible values for wait time
(expressed in seconds). By default it divides the range into
10 second intervals.
To encapsulate the process of computing these distributions, I
created a class called \py{WaitTimeCalculator}:
\begin{code}
class WaitTimeCalculator(object):
def __init__(self, pmf_z):
self.pmf_z = pmf_z
self.pmf_zb = BiasPmf(pmf)
self.pmf_y = self.PmfOfWaitTime(self.pmf_zb)
self.pmf_x = self.pmf_y
\end{code}
The parameter, \verb"pmf_z", is the unbiased distribution of \py{z}.
\verb"pmf_zb" is the biased distribution of gap time, as seen by
passengers.
\verb"pmf_y" is the distribution of wait time. \verb"pmf_x" is the
distribution of elapsed time, which is the same as the distribution of
wait time. To see why, remember that for a particular value of
\py{zp}, the distribution of \py{y} is uniform from 0 to \py{zp}.
Also
\begin{code}
x = zp - y
\end{code}
So the distribution of \py{x} is also uniform from 0 to \py{zp}.
Figure~\ref{fig.redline2} shows the distribution of \py{z}, \py{zb},
and \py{y} based on the data I collected from the Red Line web site.
To present these distributions, I am switching from Pmfs to Cdfs.
Most people are more familiar with Pmfs, but I think Cdfs are easier
to interpret, once you get used to them. And if you want to plot
several distributions on the same axes, Cdfs are the way to go.
\index{Cdf}
\index{cumulative distribution function}
The mean of \py{z} is 7.8 minutes. The mean of \py{zb} is 8.8
minutes, about 13\% higher. The mean of \py{y} is 4.4, half
the mean of \py{zb}.
As an aside, the Red Line schedule reports that trains run every
9 minutes during peak times. This is close to the average of
\py{zb}, but higher than the average of \py{z}. I exchanged email
with a representative of the MBTA, who confirmed that the reported
time between trains is deliberately conservative in order to
account for variability.
\section{Predicting wait times}
\label{elapsed}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/redline3.pdf}}
\caption{Prior and posterior of \py{x} and predicted \py{y}. }
\label{fig.redline3}
\end{figure}
Let's get back to the motivating question: suppose that when
I arrive at the platform I see 10 people waiting.
How long should I expect to wait until the next train arrives?
As always, let's start with the easiest version of the problem
and work our way up. Suppose we are given the actual distribution of
\py{z}, and we know that the passenger arrival rate,
$\lam$, is 2 passengers per minute.
In that case we can:
\begin{enumerate}
\item Use the distribution of \py{z} to compute
the prior distribution of \py{zp}, the time between trains
as seen by a passenger.
\item Then we can use the number of passengers to estimate the distribution
of \py{x}, the elapsed time since the last train.
\item Finally, we use the relation \py{y = zp - x} to get the
distribution of \py{y}.
\end{enumerate}
The first step is to create a \py{WaitTimeCalculator} that
encapsulates the distributions of \py{zp}, \py{x},
and \py{y}, prior to taking into account the number of
passengers.
\begin{code}
wtc = WaitTimeCalculator(pmf_z)
\end{code}
\verb"pmf_z" is the given distribution of gap times.
The next step is to make an \py{ElapsedTimeEstimator} (defined
below), which encapsulates the posterior distribution of \py{x} and
the predictive distribution of \py{y}.
\index{predictive distribution}
\begin{code}
ete = ElapsedTimeEstimator(wtc,
lam=2.0/60,
num_passengers=15)
\end{code}
The parameters are the \py{WaitTimeCalculator}, the passenger
arrival rate, \py{lam} (expressed in passengers per second),
and the observed number of passengers, let's say 15.
Here is the definition of \py{ElapsedTimeEstimator}:
\begin{code}
class ElapsedTimeEstimator(object):
def __init__(self, wtc, lam, num_passengers):
self.prior_x = Elapsed(wtc.pmf_x)
self.post_x = self.prior_x.Copy()
self.post_x.Update((lam, num_passengers))
self.pmf_y = PredictWaitTime(wtc.pmf_zb, self.post_x)
\end{code}
\verb"prior_x" and \verb"posterior_x" are the prior and
posterior distributions of elapsed time. \verb"pmf_y" is
the predictive distribution of wait time.
\py{ElapsedTimeEstimator} uses \py{Elapsed} and \py{PredictWaitTime},
defined below.
\py{Elapsed} is a Suite that represents the hypothetical
distribution of \py{x}. The prior distribution of \py{x}
comes straight from the \py{WaitTimeCalculator}. Then we
use the data, which consists of the arrival rate, \py{lam},
and the number of passengers on the platform, to compute
the posterior distribution.
Here's the definition of \py{Elapsed}:
\begin{code}
class Elapsed(thinkbayes.Suite):
def Likelihood(self, data, hypo):
x = hypo
lam, k = data
like = thinkbayes.EvalPoissonPmf(k, lam * x)
return like
\end{code}
As always, \py{Likelihood} takes a hypothesis and data, and
computes the likelihood of the data under the hypothesis.
In this case \py{hypo} is the elapsed time since the last train
and \py{data} is a tuple of \py{lam} and the number of
passengers.
\index{likelihood}
The likelihood of the data is the probability of getting
\py{k} arrivals in \py{x} time, given arrival rate
\py{lam}. We compute that using the PMF of the Poisson
distribution.
\index{Poisson distribution}
Finally, here's the definition of \py{PredictWaitTime}:
\begin{code}
def PredictWaitTime(pmf_zb, pmf_x):
pmf_y = pmf_zb - pmf_x
RemoveNegatives(pmf_y)
return pmf_y
\end{code}
\verb"pmf_zb" is the distribution of gaps between trains;
\verb"pmf_x" is the distribution of elapsed time, based on
the observed number of passengers. Since \py{y = zb - x},
we can compute
\begin{code}
pmf_y = pmf_zb - pmf_x
\end{code}
The subtraction operator invokes \verb"Pmf.__sub__", which enumerates
all pairs of \py{zb} and \py{x}, computes the differences, and adds
the results to \verb"pmf_y".
The resulting Pmf includes some negative values, which we know are
impossible. For example, if you arrive during a gap of 5 minutes, you
can't wait more than 5 minutes. \py{RemoveNegatives} removes the
impossible values from the distribution and renormalizes.
\begin{code}
def RemoveNegatives(pmf):
for val in pmf.Values():
if val < 0:
pmf.Remove(val)
pmf.Normalize()
\end{code}
Figure~\ref{fig.redline3} shows the results. The prior distribution
of \py{x} is the same as the distribution of \py{y} in
Figure~\ref{fig.redline2}. The posterior distribution of \py{x}
shows that, after seeing 15 passengers on the platform, we believe
that the time since the last train is probably 5-10 minutes. The
predictive distribution of \py{y} indicates that we expect the next
train in less than 5 minutes, with about 80\% confidence.
\index{predictive distribution}
\section{Estimating the arrival rate}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/redline1.pdf}}
\caption{Prior and posterior distributions of \py{lam} based
on five days of passenger data. }
\label{fig.redline1}
\end{figure}
The analysis so far has been based on the assumption that we know (1)
the distribution of gaps and (2) the passenger arrival rate. Now we
are ready to relax the second assumption.
Suppose that you just moved to Boston, so you don't know much about
the passenger arrival rate on the Red Line. After a few days of
commuting, you could make a guess, at least qualitatively. With
a little more effort, you could estimate $\lam$ quantitatively.
\index{arrival rate}
Each day when you arrive at the platform, you should note the
time and the number of passengers waiting (if the platform is too
big, you could choose a sample area). Then you should record your
wait time and the
number of new arrivals while you are waiting.
After five days, you might have data like this:
\begin{code}
k1 y k2
-- --- --
17 4.6 9
22 1.0 0
23 1.4 4
18 5.4 12
4 5.8 11
\end{code}
where \py{k1} is the number of passengers waiting when you arrive,
\py{y} is your wait time in minutes, and \py{k2} is the number of
passengers who arrive while you are waiting.
Over the course of one week, you waited 18 minutes and saw 36
passengers arrive, so you would estimate that the arrival rate is
2 passengers per minute. For practical purposes that estimate is
good enough, but for the sake of completeness I
will compute a posterior distribution for $\lam$ and show how
to use that distribution in the rest of the analysis.
\py{ArrivalRate} is a \py{Suite} that represents hypotheses about
$\lam$. As always, \py{Likelihood} takes a hypothesis and data,
and computes the likelihood of the data under the hypothesis.
In this case the hypothesis is a value of $\lam$. The data is a
pair, \py{y, k}, where \py{y} is a wait time and \py{k} is the
number of passengers that arrived.
\begin{code}
class ArrivalRate(thinkbayes.Suite):
def Likelihood(self, data, hypo):
lam = hypo
y, k = data
like = thinkbayes.EvalPoissonPmf(k, lam * y)
return like
\end{code}
This \py{Likelihood} might look familiar; it
is almost identical to \py{Elapsed.Likelihood} in
Section~\ref{elapsed}. The difference is that in {\tt
Elapsed.Likelihood} the hypothesis is \py{x}, the elapsed time; in
\py{ArrivalRate.Likelihood} the hypothesis is \py{lam}, the arrival
rate. But in both cases the likelihood is the probability of seeing
\py{k} arrivals in some period of time, given \py{lam}.
\py{ArrivalRateEstimator} encapsulates the process of estimating
$\lam$. The parameter, \verb"passenger_data", is a list
of \py{k1, y, k2} tuples, as in the table above.
\index{numpy}
\begin{code}
class ArrivalRateEstimator(object):
def __init__(self, passenger_data):
low, high = 0, 5
n = 51
hypos = numpy.linspace(low, high, n) / 60
self.prior_lam = ArrivalRate(hypos)
self.post_lam = self.prior_lam.Copy()
for k1, y, k2 in passenger_data:
self.post_lam.Update((y, k2))
\end{code}
\verb"__init__" builds
\py{hypos}, which is a sequence of hypothetical values for \py{lam},
then builds the prior distribution, \verb"prior_lam".
The \py{for} loop updates the prior with data, yielding the posterior
distribution, \verb"post_lam".
Figure~\ref{fig.redline1} shows
the prior and posterior distributions. As expected, the mean and
median of the posterior are near the observed rate, 2 passengers per
minute. But the spread of the posterior distribution captures our
uncertainty about $\lam$ based on a small sample.
\section{Incorporating uncertainty}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/redline4.pdf}}
\caption{Predictive distributions of \py{y} for possible values
of \py{lam}. }
\label{fig.redline4}
\end{figure}
Whenever there is uncertainty about one of the inputs to an analysis,
we can take it into account by a process like this:
\index{uncertainty}
\begin{enumerate}
\item Implement the analysis based on a deterministic value of the
uncertain parameter (in this case $\lam$).
\item Compute the distribution of the uncertain parameter.
\item Run the analysis for each value of the parameter, and generate a
set of predictive distributions.
\index{predictive distribution}
\item Compute a mixture of the predictive distributions, using the
weights from the distribution of the parameter.
\index{mixture}
\end{enumerate}
We have already done steps (1) and (2). I wrote a class
called \py{WaitMixtureEstimator} to handle steps (3) and (4).
\begin{code}
class WaitMixtureEstimator(object):
def __init__(self, wtc, are, num_passengers=15):
self.metapmf = thinkbayes.Pmf()
for lam, prob in sorted(are.post_lam.Items()):
ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
self.metapmf.Set(ete.pmf_y, prob)
self.mixture = thinkbayes.MakeMixture(self.metapmf)
\end{code}
\py{wtc} is the \py{WaitTimeCalculator} that contains the
distribution of \py{zb}. \py{are} is the \py{ArrivalTimeEstimator}
that contains the distribution of \py{lam}.
The first line makes a meta-Pmf that maps from each possible
distribution of \py{y} to its probability. For each value
of \py{lam}, we use \py{ElapsedTimeEstimator} to
compute the corresponding distribution of
\py{y} and store it in the Meta-Pmf. Then
we use \py{MakeMixture} to compute the mixture.
\index{MakeMixture}
\index{meta-Pmf}
\index{mixture}
Figure~\ref{fig.redline4} shows the results. The shaded lines
in the background are the distributions of \py{y} for each value
of \py{lam}, with line thickness that represents likelihood.
The dark line is the mixture of these distributions.
In this case we could get a very similar result using a single point
estimate of \py{lam}. So it was not necessary, for practical purposes,
to include the uncertainty of the estimate.
In general, it is important to include variability if the system
response is non-linear; that is, if small changes in the input can
cause big changes in the output. In this case, posterior variability
in \py{lam} is small and the system response is approximately
linear for small perturbations.
\index{non-linear}
\section{Decision analysis}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/redline5.pdf}}
\caption{Probability that wait time exceeds 15 minutes as
a function of the number of passengers on the platform. }
\label{fig.redline5}
\end{figure}
At this point we can use the number of passengers on the platform
to predict the distribution of wait times. Now
let's get to the second part of the question: when should I stop
waiting for the train and go catch a taxi?
\index{decision analysis}
Remember that in the original scenario, I am trying to get to
South Station to catch the commuter rail. Suppose I leave
the office with enough time that I can wait 15 minutes
and still make my connection at South Station.
In that case I would like to know the probability that \py{y} exceeds
15 minutes as a function of \verb"num_passengers". It is easy enough
to use the
analysis from Section~\ref{elapsed} and run it for a range of
\verb"num_passengers".
But there's a problem.
The analysis is sensitive to the frequency of long delays, and
because long delays are rare, it is hard to estimate
their frequency.
I only have data from one week,
and the longest delay I observed was 15 minutes. So I can't
estimate the frequency of longer delays accurately.
However, I can use previous observations to make at least a coarse
estimate. When I commuted by Red Line for a year, I saw three long
delays caused by a signaling problem, a power outage, and ``police
activity'' at another stop. So I estimate that there are about
3 major delays per year.
But remember that my observations are biased. I am more likely
to observe long delays because they affect a large number
of passengers. So we should treat my observations as a sample
of \py{zb} rather than \py{z}. Here's how we can do that.
\index{observer bias}
During my year of commuting, I took the Red Line home about 220
times. So I take the observed gap times, \verb"gap_times",
generate a sample of 220 gaps, and compute their Pmf:
\begin{code}
n = 220
cdf_z = thinkbayes.MakeCdfFromList(gap_times)
sample_z = cdf_z.Sample(n)
pmf_z = thinkbayes.MakePmfFromList(sample_z)
\end{code}
Next I bias \verb"pmf_z" to get the distribution of
\py{zb}, draw a sample, and then add in delays of
30, 40, and 50 minutes (expressed in seconds):
\begin{code}
cdf_zp = BiasPmf(pmf_z).MakeCdf()
sample_zb = cdf_zp.Sample(n) + [1800, 2400, 3000]
\end{code}
\py{Cdf.Sample} is more efficient than \py{Pmf.Sample}, so it
is usually faster to convert a Pmf to a Cdf before sampling.
Next I use the sample of \py{zb} to estimate a Pdf using
KDE, and then convert the Pdf to a Pmf:
\begin{code}
pdf_zb = thinkbayes.EstimatedPdf(sample_zb)
xs = MakeRange(low=60)
pmf_zb = pdf_zb.MakePmf(xs)
\end{code}
Finally I unbias the distribution of \py{zb} to get the
distribution of \py{z}, which I use to create the
\py{WaitTimeCalculator}:
\begin{code}
pmf_z = UnbiasPmf(pmf_zb)
wtc = WaitTimeCalculator(pmf_z)
\end{code}
This process is complicated, but
all of the steps are operations we have seen before.
Now we are ready to compute the probability of a long wait.
\begin{code}
def ProbLongWait(num_passengers, minutes):
ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
cdf_y = ete.pmf_y.MakeCdf()
prob = 1 - cdf_y.Prob(minutes * 60)
\end{code}
Given the number of passengers on the platform,
\py{ProbLongWait}
makes an \py{ElapsedTimeEstimator},
extracts the distribution of wait time, and
computes
the probability that wait time
exceeds \py{minutes}.
Figure~\ref{fig.redline5} shows the result. When the number of
passengers is less than 20, we infer that the system is
operating normally, so the probability of a long delay is small.
If there are 30 passengers, we estimate that it has been 15
minutes since the last train; that's longer than a normal delay,
so we infer that something is wrong and expect longer delays.
If we are willing to accept a 10\% chance of missing the connection
at South Station, we should stay and wait as long as there
are fewer than 30 passengers, and take a taxi if there are more.
Or, to take this analysis one step further, we could quantify the cost
of missing the connection and the cost of taking a taxi, then choose
the threshold that minimizes expected cost.
\section{Discussion}
The analysis so far has been based on the assumption that the
arrival rate of passengers is the same every day. For a commuter
train during rush hour, that might not be a bad assumption, but
there are some obvious exceptions. For example, if there is a special
event nearby, a large number of people might arrive at the same time.
In that case, the estimate of \py{lam} would be too low, so the
estimates of \py{x} and \py{y} would be too high.
If special events are as common as major delays, it would
be important to include them in the model. We could do that by
extending the distribution of \py{lam} to include occasional
large values.
We started with the assumption that we know
distribution of \py{z}.
As an alternative, a passenger could estimate \py{z}, but it would
not be easy.
As a passenger, you only
observe only your own wait time, \py{y}. Unless you skip
the first train and wait for the second, you don't
observe the gap between trains, \py{z}.
However, we could make some inferences about \py{zb}. If we note
the number of passengers waiting when we arrive, we can estimate
the elapsed time since the last train, \py{x}. Then we observe
\py{y}. If we add the posterior distribution of \py{x} to
the observed \py{y}, we get a distribution that represents
our posterior belief about the observed value of \py{zb}.
We can use this distribution to update our beliefs about the
distribution of \py{zb}. Finally, we can compute the
inverse of \py{BiasPmf} to get from the distribution of \py{zb}
to the distribution of \py{z}.
I leave this analysis as an exercise for the
reader. One suggestion: you should read Chapter~\ref{species} first.
You can find the outline of
a solution in \url{http://thinkbayes.com/redline.py}.
For more information
see Section~\ref{download}.
\section{Exercises}
\begin{exercise}
This exercise is from
MacKay, {\em Information Theory, Inference, and Learning Algorithms}:
\index{MacKay, David}
\begin{quote}
Unstable particles are emitted from a source and decay at a
distance $x$, a real number that has an exponential probability
distribution with [parameter] $\lam$. Decay events can only be
observed if they occur in a window extending from $x=1$ cm to $x=20$
cm. $N$ decays are observed at locations $\{ 1.5, 2, 3, 4, 5, 12 \}$
cm. What is the posterior distribution of $\lam$?
\end{quote}
You can download a solution to this exercise from
\url{http://thinkbayes.com/decay.py}.
\end{exercise}
\chapter{Hypothesis Testing}
\label{hypotest}
\section{Back to the Euro problem}
In Section~\ref{euro} I presented a problem from MacKay's {\it Information
Theory, Inference, and Learning Algorithms}:
\index{MacKay, David}
\begin{quote}
A statistical statement appeared in ``The Guardian" on Friday January 4, 2002:
\begin{quote}
When spun on edge 250 times, a Belgian one-euro coin came
up heads 140 times and tails 110. `It looks very suspicious
to me,' said Barry Blight, a statistics lecturer at the London
School of Economics. `If the coin were unbiased, the chance of
getting a result as extreme as that would be less than 7\%.'
\end{quote}
But do these data give evidence that the coin is biased rather than fair?
\end{quote}
We estimated the probability that the coin would
land face up, but we didn't really answer MacKay's question:
Do the data give evidence that the coin is biased?
\index{Euro problem}
\index{evidence}
In Chapter~\ref{more} I proposed that data are in favor of
a hypothesis if the data are more likely under the hypothesis than
under the alternative or, equivalently, if the Bayes factor is greater
than 1.
\index{hypothesis testing}
\index{Bayes factor}
In the Euro example, we have two hypotheses to consider: I'll use
$F$ for the hypothesis that the coin is fair and $B$ for the hypothesis
that it is biased.
\index{fair coin}
\index{biased coin}
If the coin is fair, it is easy to compute the likelihood of the
data, \p{D|F}. In fact, we already wrote the function
that does it.
\begin{code}
def Likelihood(self, data, hypo):
x = hypo / 100.0
head, tails = data
like = x**heads * (1-x)**tails
return like
\end{code}
To use it we can
create a \py{Euro} suite and invoke
\py{Likelihood}:
\begin{code}
suite = Euro()
likelihood = suite.Likelihood(data, 50)
\end{code}
\p{D|F} is $5.5 \cdot 10^{-76}$, which doesn't tell us much except
that the probability of seeing any particular dataset is very small.
It takes two likelihoods to make a ratio, so we also have to
compute \p{D|B}.
It is not obvious how to compute the likelihood of $B$, because
it's not obvious what ``biased'' means.
One possibility is to cheat and look at the data before we define
the hypothesis. In that case we would say that ``biased'' means that
the probability of heads is 140/250.
\begin{code}
actual_percent = 100.0 * 140 / 250
likelihood = suite.Likelihood(data, actual_percent)
\end{code}
This version of $B$ I call \verb"B_cheat"; the likelihood of
\verb"b_cheat" is $34 \cdot 10^{-76}$ and the likelihood ratio is
6.1. So we would say that the data are evidence in favor of this
version of $B$.
\index{evidence}
But using the data to formulate the hypothesis
is obviously bogus. By that definition, any dataset would
be evidence in favor of $B$, unless the observed percentage of heads
is exactly 50\%.
\index{bogus}
\section{Making a fair comparison}
\label{suitelike}
To make a legitimate comparison, we have to define $B$ without looking
at the data. So let's try a different definition. If you inspect
a Belgian Euro coin, you might notice that the ``heads'' side is more
prominent than the ``tails'' side. You might expect the shape to
have some effect on
$x$, but be unsure whether it makes heads more or less
likely. So you might say ``I think the coin is biased so that
$x$ is either 0.6 or 0.4, but I am not sure which.''
We can think of this version, which I'll call \verb"B_two"
as a hypothesis made up of two
sub-hypotheses. We can compute the likelihood for each
sub-hypothesis and then compute the average likelihood.
\begin{code}
like40 = suite.Likelihood(data, 40)
like60 = suite.Likelihood(data, 60)
likelihood = 0.5 * like40 + 0.5 * like60
\end{code}
The likelihood ratio (or Bayes factor) for \verb"b_two" is 1.3, which
means the data provide weak evidence in favor of \verb"b_two".
\index{evidence}
\index{likelihood ratio}
\index{Bayes factor}
More generally, suppose you suspect that the coin is biased, but you
have no clue about the value of $x$. In that case you might build a
Suite, which I call \verb"b_uniform", to represent sub-hypotheses from
0 to 100.
\begin{code}
b_uniform = Euro(xrange(0, 101))
b_uniform.Remove(50)
b_uniform.Normalize()
\end{code}
I initialize \verb"b_uniform" with values from 0 to 100.
I removed the sub-hypothesis that $x$ is 50\%, because if
$x$ is 50\% the coin is fair, but it has almost no
effect on the result whether you remove it or not.
To compute the likelihood of
\verb"b_uniform" we compute the likelihood of each sub-hypothesis
and accumulate a weighted average.
\begin{code}
def SuiteLikelihood(suite, data):
total = 0
for hypo, prob in suite.Items():
like = suite.Likelihood(data, hypo)
total += prob * like
return total
\end{code}
The likelihood ratio for \verb"b_uniform" is 0.47, which means
that the data are weak evidence against \verb"b_uniform",
compared to $F$.
\index{likelihood}
If you think about the computation performed by
\verb"SuiteLikelihood", you might notice that it is similar to an
update. To refresh your memory, here's the \py{Update} function:
\begin{code}
def Update(self, data):
for hypo in self.Values():
like = self.Likelihood(data, hypo)
self.Mult(hypo, like)
return self.Normalize()
\end{code}
And here's \py{Normalize}:
\begin{code}
def Normalize(self):
total = self.Total()
factor = 1.0 / total
for x in self.d:
self.d[x] *= factor
return total
\end{code}
The return value from \py{Normalize} is the total of the
probabilities in the Suite, which is the average of the likelihoods
for the sub-hypotheses, weighted by the prior probabilities. And {\tt
Update} passes this value along, so instead of using {\tt
SuiteLikelihood}, we could compute the likelihood of
\verb"b_uniform" like this:
\begin{code}
likelihood = b_uniform.Update(data)
\end{code}
\section{The triangle prior}
In Chapter~\ref{more} we also considered a triangle-shaped prior that
gives higher probability to values of $x$ near 50\%. If we think of
this prior as a suite of sub-hypotheses, we can compute its likelihood
like this:
\index{triangle distribution}
\begin{code}
b_triangle = TrianglePrior()
likelihood = b_triangle.Update(data)
\end{code}
The likelihood ratio for \verb"b_triangle" is 0.84, compared to $F$, so
again we would say that the data are weak evidence against $B$.
\index{evidence}
The following table shows the priors we have considered, the
likelihood of each, and the likelihood ratio (or Bayes factor)
relative to $F$.
\index{likelihood ratio}
\index{Bayes factor}
\begin{tabular}{|l|r|r|}
\hline
Hypothesis & Likelihood & Bayes \\
& $\times 10^{-76}$ & Factor \\
\hline
$F$ & 5.5 & -- \\
\verb"B_cheat" & 34 & 6.1 \\
\verb"B_two" & 7.4 & 1.3 \\
\verb"B_uniform" & 2.6 & 0.47 \\
\verb"B_triangle" & 4.6 & 0.84 \\
\hline
\end{tabular}
Depending on which definition we choose, the data might provide
evidence for or against the hypothesis that the coin is biased, but
in either case it is relatively weak evidence.
In summary, we can use Bayesian hypothesis testing to compare the
likelihood of $F$ and $B$, but we have to do some work to specify
precisely what $B$ means. This specification depends on background
information about coins and their behavior when spun, so people
could reasonably disagree about the right definition.
My presentation of this example follows
David MacKay's discussion, and comes to the same conclusion.
You can download the code I used in this chapter from
\url{http://thinkbayes.com/euro3.py}.
For more information
see Section~\ref{download}.
\section{Discussion}
The Bayes factor for \verb"B_uniform" is 0.47, which means
that the data provide evidence against this hypothesis, compared
to $F$. In the previous section I characterized this evidence
as ``weak,'' but didn't say why.
\index{evidence}
Part of the answer is historical. Harold Jeffreys, an early
proponent of Bayesian statistics, suggested a scale for
interpreting Bayes factors:
\begin{tabular}{|l|l|}
\hline
Bayes & Strength \\
Factor & \\
\hline
1 -- 3 & Barely worth mentioning \\
3 -- 10 & Substantial \\
10 -- 30 & Strong \\
30 -- 100 & Very strong \\
$>$ 100 & Decisive \\
\hline
\end{tabular}
In the example, the Bayes factor is 0.47 in favor of \verb"B_uniform",
so it is 2.1 in favor of $F$, which Jeffreys would consider ``barely
worth mentioning.'' Other authors have suggested variations on the
wording. To avoid arguing about adjectives, we could think about odds
instead.
If your prior odds are 1:1, and you see evidence with Bayes
factor 2, your posterior odds are 2:1. In terms of probability,
the data changed your degree of belief from 50\% to 66\%. For
most real world problems, that change would be small relative
to modeling errors and other sources of uncertainty.
On the other hand, if you had seen evidence with Bayes
factor 100, your posterior odds would be 100:1 or more than 99\%.
Whether or not you agree that such evidence is ``decisive,''
it is certainly strong.
\section{The beta distribution}
\label{beta}
\index{beta distribution}
There is one more optimization that solves this problem
even faster.
So far we have used a Pmf object to represent a discrete set of
values for \py{x}. Now we will use a continuous
distribution, specifically the beta distribution (see
\url{http://en.wikipedia.org/wiki/Beta_distribution}).
\index{continuous distribution}
The beta distribution is defined on the interval from 0 to 1
(including both), so it is a natural choice for describing
proportions and probabilities. But wait, it gets better.
It turns out that if you do a Bayesian update with a binomial
likelihood function, which is what we did in the previous section, the beta
distribution is a {\bf conjugate prior}. That means that if the prior
distribution for \py{x} is a beta distribution, the posterior is also
a beta distribution. But wait, it gets even better.
\index{binomial likelihood function}
\index{conjugate prior}
The shape of the beta distribution depends on two parameters, written
$\alpha$ and $\beta$, or \py{alpha} and \py{beta}. If the prior
is a beta distribution with parameters \py{alpha} and \py{beta}, and
we see data with \py{h} heads and \py{t} tails, the posterior is a
beta distribution with parameters \py{alpha+h} and \py{beta+t}. In
other words, we can do an update with two additions.
\index{parameter}
So that's great, but it only works if we can find a beta distribution
that is a good choice for a prior. Fortunately, for many realistic
priors there is a beta distribution that is at least a good
approximation, and for a uniform prior there is a perfect match. The
beta distribution with \py{alpha=1} and \py{beta=1} is uniform from
0 to 1.
Let's see how we can take advantage of all this.
\py{thinkbayes.py} provides
a class that represents a beta distribution:
\index{Beta object}
\begin{code}
class Beta(object):
def __init__(self, alpha=1, beta=1):
self.alpha = alpha
self.beta = beta
\end{code}
By default \verb"__init__" makes a uniform distribution.
\py{Update} performs a Bayesian update:
\begin{code}
def Update(self, data):
heads, tails = data
self.alpha += heads
self.beta += tails
\end{code}
\py{data} is a pair of integers representing the number of
heads and tails.
So we have yet another way to solve the Euro problem:
\begin{code}
beta = thinkbayes.Beta()
beta.Update((140, 110))
print beta.Mean()
\end{code}
\py{Beta} provides \py{Mean}, which
computes a simple function of \py{alpha}
and \py{beta}:
\begin{code}
def Mean(self):
return float(self.alpha) / (self.alpha + self.beta)
\end{code}
For the Euro problem the posterior mean is 56\%, which is the
same result we got using Pmfs.
\py{Beta} also provides \py{EvalPdf}, which evaluates
the probability density
function (PDF) of the beta distribution:
\index{probability density function}
\index{PDF}
\begin{code}
def EvalPdf(self, x):
return x**(self.alpha-1) * (1-x)**(self.beta-1)
\end{code}
Finally, \py{Beta} provides \py{MakePmf}, which
uses \py{EvalPdf} to generate a discrete approximation
of the beta distribution.
\section{Exercises}
\begin{exercise}
Some people believe in the existence of extra-sensory
perception (ESP); for example, the ability of some people to guess
the value of an unseen playing card with probability better
than chance.
\index{ESP}
\index{extra-sensory perception}
What is your prior degree of belief in this kind of ESP?
Do you think it is as likely to exist as not? Or are you
more skeptical about it? Write down your prior odds.
Now compute the strength of the evidence it would take to
convince you that ESP is at least 50\% likely to exist.
What Bayes factor would be needed to make you 90\% sure
that ESP exists?
Also, notice that in a Bayesian update, we multiply
each prior probability by a likelihood, so if \p{H} is 0,
\p{H|D} is also 0, regardless of $D$. In the Euro problem,
if you are convinced that \py{x} is less than 50\%, and you assign
probability 0 to all other hypotheses, no amount of data will
convince you otherwise.
\index{Euro problem}
This observation is the basis of {\bf Cromwell's rule}, which is the
recommendation that you should avoid giving a prior probability of
0 to any hypothesis that is even remotely possible
(see \url{http://en.wikipedia.org/wiki/Cromwell's_rule}).
\index{Cromwell's rule}
Cromwell's rule is named after Oliver Cromwell, who wrote, ``I beseech
you, in the bowels of Christ, think it possible that you may be
mistaken.'' For Bayesians, this turns out to be good advice (even if
it's a little overwrought).
\index{Cromwell, Oliver}
\end{exercise}
\begin{exercise}
Suppose that your answer to the previous question is 1000;
that is, evidence with Bayes factor 1000 in favor of ESP would
be sufficient to change your mind.
Now suppose that you read a paper in a respectable peer-reviewed
scientific journal that presents evidence with Bayes factor 1000 in
favor of ESP. Would that change your mind?
If not, how do you resolve the apparent contradiction?
You might find it helpful to read about David Hume's article, ``Of
Miracles,'' at \url{http://en.wikipedia.org/wiki/Of_Miracles}.
\index{Hume, David}
\end{exercise}
\chapter{Evidence}
\label{evidence}
\section{Interpreting SAT scores}
Suppose you are the Dean of Admission at a small engineering
college in Massachusetts, and you are considering two candidates,
Alice and Bob, whose qualifications are similar in many ways,
with the exception that Alice got a higher score on the Math
portion of the SAT, a standardized test intended to measure
preparation for college-level work in mathematics.
\index{SAT}
\index{standardized test}
If Alice got 780 and Bob got a 740 (out of a possible 800), you might
want to know whether that difference is evidence that Alice is better
prepared than Bob, and what the strength of that evidence is.
\index{evidence}
Now in reality, both scores are very good, and both
candidates are probably well prepared for college math. So
the real Dean of Admission would probably suggest that we choose
the candidate who best demonstrates the other skills and
attitudes we look for in students. But as an example of
Bayesian hypothesis testing, let's stick with a narrower question:
``How strong is the evidence that Alice is better prepared
than Bob?''
To answer that question, we need to make some modeling decisions.
I'll start with a simplification I know is wrong; then we'll come back
and improve the model. I pretend, temporarily, that
all SAT questions are equally difficult. Actually, the designers of
the SAT choose questions with a range of difficulty, because that
improves the ability to measure statistical differences between
test-takers.
\index{modeling}
But if we choose a model where all questions are equally difficult, we
can define a characteristic, \verb"p_correct", for each test-taker,
which is the probability of answering any question correctly. This
simplification makes it easy to compute the likelihood of a given
score.
\section{The scale}
In order to understand SAT scores, we have to understand the scoring
and scaling process. Each test-taker gets a raw score based on the
number of correct and incorrect questions. The raw score is converted
to a scaled score in the range 200--800.
\index{scaled score}
In 2009, there were 54 questions on the math SAT. The raw score
for each test-taker is the number of questions answered correctly
minus a penalty of $1/4$ point for each question answered incorrectly.
The College Board, which administers the SAT, publishes the
map from raw scores to scaled scores. I have downloaded that
data and wrapped it in an Interpolator object that provides a forward
lookup (from raw score to scaled) and a reverse lookup (from scaled
score to raw).
\index{College Board}
You can download the code for this example from
\url{http://thinkbayes.com/sat.py}.
For more information
see Section~\ref{download}.
\section{The prior}
The College Board also publishes the distribution of scaled scores
for all test-takers. If we convert each scaled score to a raw score,
and divide by the number of questions, the result is an estimate
of \verb"p_correct".
So we can use the distribution of raw scores to model the
prior distribution of \verb"p_correct".
Here is the code that reads and processes the data:
\begin{code}
class Exam(object):
def __init__(self):
self.scale = ReadScale()
scores = ReadRanks()
score_pmf = thinkbayes.MakePmfFromDict(dict(scores))
self.raw = self.ReverseScale(score_pmf)
self.max_score = max(self.raw.Values())
self.prior = DivideValues(self.raw, self.max_score)
\end{code}
\py{Exam} encapsulates the information we have about the exam.
\py{ReadScale} and \py{ReadRanks} read files and return
objects that contain the data:
\py{self.scale} is the \py{Interpolator} that converts
from raw to scaled scores and back; \py{scores} is a list
of (score, frequency) pairs.
\verb"score_pmf" is the Pmf of
scaled scores. \py{self.raw} is the Pmf of raw scores, and
\py{self.prior} is the Pmf of \verb"p_correct".
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/sat_prior.pdf}}
\caption{Prior distribution of \py{p_correct} for SAT test-takers.}
\label{fig.satprior}
\end{figure}
Figure~\ref{fig.satprior} shows the prior distribution of
\verb"p_correct". This distribution is approximately Gaussian, but it
is compressed at the extremes. By design, the SAT has the most power
to discriminate between test-takers within two standard deviations of
the mean, and less power outside that range.
\index{Gaussian distribution}
For each test-taker, I define a Suite called \py{Sat} that
represents the distribution of \verb"p_correct". Here's the definition:
\begin{code}
class Sat(thinkbayes.Suite):
def __init__(self, exam, score):
thinkbayes.Suite.__init__(self)
self.exam = exam
self.score = score
# start with the prior distribution
for p_correct, prob in exam.prior.Items():
self.Set(p_correct, prob)
# update based on an exam score
self.Update(score)
\end{code}
\verb"__init__" takes an Exam object and a scaled score. It makes a
copy of the prior distribution and then updates itself based on the
exam score.
As usual, we inherit \py{Update} from \py{Suite} and provide
\py{Likelihood}:
\begin{code}
def Likelihood(self, data, hypo):
p_correct = hypo
score = data
k = self.exam.Reverse(score)
n = self.exam.max_score
like = thinkbayes.EvalBinomialPmf(k, n, p_correct)
return like
\end{code}
\py{hypo} is a hypothetical
value of \verb"p_correct", and \py{data} is a scaled score.
To keep things simple, I interpret the raw score as the number of
correct answers, ignoring the penalty for wrong answers. With
this simplification, the likelihood is given by the binomial
distribution, which computes the probability of $k$ correct
responses out of $n$ questions.
\index{binomial distribution}
\index{raw score}
\section{Posterior}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_p_corr.pdf}}
\caption{Posterior distributions of \py{p_correct} for Alice and Bob.}
\label{fig.satposterior1}
\end{figure}
Figure~\ref{fig.satposterior1} shows the posterior distributions
of \verb"p_correct" for Alice and Bob based on their exam scores.
We can see that they overlap, so it is possible that \verb"p_correct"
is actually higher for Bob, but it seems unlikely.
Which brings us back to the original question, ``How strong is the
evidence that Alice is better prepared than Bob?'' We can use the
posterior distributions of \verb"p_correct" to answer this question.
To formulate the question in terms of Bayesian hypothesis testing,
I define two hypotheses:
\begin{itemize}
\item $A$: \verb"p_correct" is higher for Alice than for Bob.
\item $B$: \verb"p_correct" is higher for Bob than for Alice.
\end{itemize}
To compute the likelihood of $A$, we can enumerate all pairs of values
from the posterior distributions and add up the total probability of
the cases where \verb"p_correct" is higher for Alice than for Bob.
And we already have a function, \verb"thinkbayes.PmfProbGreater",
that does that.
So we can define a Suite that computes the posterior probabilities
of $A$ and $B$:
\begin{code}
class TopLevel(thinkbayes.Suite):
def Update(self, data):
a_sat, b_sat = data
a_like = thinkbayes.PmfProbGreater(a_sat, b_sat)
b_like = thinkbayes.PmfProbLess(a_sat, b_sat)
c_like = thinkbayes.PmfProbEqual(a_sat, b_sat)
a_like += c_like / 2
b_like += c_like / 2
self.Mult('A', a_like)
self.Mult('B', b_like)
self.Normalize()
\end{code}
Usually when we define a new Suite, we inherit \py{Update}
and provide \py{Likelihood}. In this case I override \py{Update},
because it is easier to evaluate the likelihood of both
hypotheses at the same time.
The data passed to \py{Update} are Sat objects that represent
the posterior distributions of \verb"p_correct".
\verb"a_like" is the total probability that
\verb"p_correct" is higher for Alice; \verb"b_like" is that
probability that it is higher for Bob.
\verb"c_like" is the probability that they are ``equal,'' but this
equality is an artifact of the decision to model \verb"p_correct" with
a set of discrete values. If we use more values, \verb"c_like"
is smaller, and in the extreme, if \verb"p_correct" is
continuous, \verb"c_like" is zero. So I treat \verb"c_like" as
a kind of round-off error and split it evenly between \verb"a_like"
and \verb"b_like".
Here is the code that creates \py{TopLevel} and updates it:
\begin{code}
exam = Exam()
a_sat = Sat(exam, 780)
b_sat = Sat(exam, 740)
top = TopLevel('AB')
top.Update((a_sat, b_sat))
top.Print()
\end{code}
The likelihood of $A$ is 0.79 and the likelihood of $B$ is 0.21. The
likelihood ratio (or Bayes factor) is 3.8, which means that these test
scores are evidence that Alice is better than Bob at answering SAT
questions. If we believed, before seeing the test scores, that $A$
and $B$ were equally likely, then after seeing the scores we should
believe that the probability of $A$ is 79\%, which means there is
still a 21\% chance that Bob is actually better prepared.
\index{likelihood ratio}
\index{Bayes factor}
\section{A better model}
Remember that the analysis we have done so far is based on
the simplification that all SAT questions are equally difficult.
In reality, some are easier than others, which means that the
difference between Alice and Bob might be even smaller.
But how big is the modeling error? If it is small, we conclude
that the first model---based on the simplification that all questions
are equally difficult---is good enough. If it's large,
we need a better model.
\index{modeling error}
In the next few sections, I develop a better model and
discover (spoiler alert!) that the modeling error is small. So if
you are satisfied with the simple model, you can skip to the next
chapter. If you want to see how the more realistic model works,
read on...
\begin{itemize}
\item Assume that each test-taker has some
degree of \py{efficacy}, which measures their
ability to answer SAT questions.
\index{efficacy}
\item Assume that each question has some level of
\py{difficulty}.
\item Finally, assume that the chance that a test-taker answers a
question correctly is related to \py{efficacy} and \py{difficulty}
according to this function:
\begin{code}
def ProbCorrect(efficacy, difficulty, a=1):
return 1 / (1 + math.exp(-a * (efficacy - difficulty)))
\end{code}
\end{itemize}
This function is a simplified version of the curve used in {\bf item
response theory}, which you can read about at
\url{http://en.wikipedia.org/wiki/Item_response_theory}. {\tt
efficacy} and \py{difficulty} are considered to be on the same
scale, and the probability of getting a question right depends only on
the difference between them.
\index{item response theory}
When \py{efficacy} and \py{difficulty} are equal, the
probability of getting the question right is 50\%. As
\py{efficacy} increases, this probability approaches 100\%.
As it decreases (or as \py{difficulty} increases), the
probability approaches 0\%.
Given the distribution of \py{efficacy} across test-takers
and the distribution of \py{difficulty} across questions, we
can compute the expected distribution of raw scores. We'll do that
in two steps. First, for a person with given \py{efficacy},
we'll compute the distribution of raw scores.
\begin{code}
def PmfCorrect(efficacy, difficulties):
pmf0 = thinkbayes.Pmf([0])
ps = [ProbCorrect(efficacy, diff) for diff in difficulties]
pmfs = [BinaryPmf(p) for p in ps]
dist = sum(pmfs, pmf0)
return dist
\end{code}
\py{difficulties} is a list of difficulties, one for each question.
\py{ps} is a list of probabilities, and \py{pmfs} is a list of
two-valued Pmf objects; here's the function that makes them:
\begin{code}
def BinaryPmf(p):
pmf = thinkbayes.Pmf()
pmf.Set(1, p)
pmf.Set(0, 1-p)
return pmf
\end{code}
\py{dist} is the sum of these Pmfs. Remember from Section~\ref{addends}
that when we add up Pmf objects, the result is the distribution
of the sums. In order to use Python's \py{sum} to add up Pmfs,
we have to provide \py{pmf0} which is the identity for Pmfs,
so \py{pmf + pmf0} is always \py{pmf}.
If we know a person's efficacy, we can compute their distribution
of raw scores. For a group of people with a different efficacies, the
resulting distribution of raw scores is a mixture. Here's the code
that computes the mixture:
\begin{code}
# class Exam:
def MakeRawScoreDist(self, efficacies):
pmfs = thinkbayes.Pmf()
for efficacy, prob in efficacies.Items():
scores = PmfCorrect(efficacy, self.difficulties)
pmfs.Set(scores, prob)
mix = thinkbayes.MakeMixture(pmfs)
return mix
\end{code}
\py{MakeRawScoreDist} takes \py{efficacies}, which is a Pmf that
represents the distribution of efficacy across test-takers. I assume
it is Gaussian with mean 0 and standard deviation 1.5. This
choice is mostly arbitrary. The probability of getting a question
correct depends on the difference between efficacy and difficulty, so
we can choose the units of efficacy and then calibrate the units of
difficulty accordingly. \index{Gaussian distribution}
\py{pmfs} is a meta-Pmf that contains one Pmf for each level of
efficacy, and maps to the fraction of test-takers at that level. {\tt
MakeMixture} takes the meta-pmf and computes the distribution of the
mixture (see Section~\ref{mixture}). \index{meta-Pmf}
\index{MakeMixture}
\section{Calibration}
If we were given the distribution of difficulty, we could use
\verb"MakeRawScoreDist" to compute the distribution of raw scores.
But for us the problem is the other way around: we are given the
distribution of raw scores and we want to infer the distribution of
difficulty.
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/sat_calibrate.pdf}}
\caption{Actual distribution of raw scores and a model to fit it.}
\label{fig.satcalibrate}
\end{figure}
I assume that the distribution of difficulty is uniform with
parameters \py{center} and \py{width}. \py{MakeDifficulties}
makes a list of difficulties with these parameters.
\index{numpy}
\begin{code}
def MakeDifficulties(center, width, n):
low, high = center-width, center+width
return numpy.linspace(low, high, n)
\end{code}
By trying out a few combinations, I found that
\py{center=-0.05} and \py{width=1.8} yield a distribution
of raw scores similar to the actual data, as shown in
Figure~\ref{fig.satcalibrate}.
\index{calibration}
So, assuming that the distribution of difficulty is uniform,
its range is approximately
\py{-1.85} to \py{1.75}, given that
efficacy is Gaussian with mean 0 and standard deviation 1.5.
\index{Gaussian distribution}
The following table shows the range of \py{ProbCorrect} for
test-takers at different levels of efficacy:
\begin{tabular}{|r|r|r|r|}
\hline
& \multicolumn{3}{|c|}{Difficulty} \\
\hline
Efficacy & -1.85 & -0.05 & 1.75 \\
\hline
3.00 & 0.99 & 0.95 & 0.78 \\
1.50 & 0.97 & 0.82 & 0.44 \\
0.00 & 0.86 & 0.51 & 0.15 \\
-1.50 & 0.59 & 0.19 & 0.04 \\
-3.00 & 0.24 & 0.05 & 0.01 \\
\hline
\end{tabular}
Someone with efficacy 3 (two standard deviations above
the mean) has a 99\% chance of answering the easiest questions on
the exam, and a 78\% chance of answering the hardest. On the other
end of the range, someone two standard deviations below the mean
has only a 24\% chance of answering the easiest questions.
\section{Posterior distribution of efficacy}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_eff.pdf}}
\caption{Posterior distributions of efficacy for Alice and Bob.}
\label{fig.satposterior2}
\end{figure}
Now that the model is calibrated, we can compute the posterior
distribution of efficacy for Alice and Bob. Here is a version of the
Sat class that uses the new model:
\begin{code}
class Sat2(thinkbayes.Suite):
def __init__(self, exam, score):
self.exam = exam
self.score = score
# start with the Gaussian prior
efficacies = thinkbayes.MakeGaussianPmf(0, 1.5, 3)
thinkbayes.Suite.__init__(self, efficacies)
# update based on an exam score
self.Update(score)
\end{code}
\verb"Update" invokes
\verb"Likelihood", which computes the likelihood of a given test score
for a hypothetical level of efficacy.
\begin{code}
def Likelihood(self, data, hypo):
efficacy = hypo
score = data
raw = self.exam.Reverse(score)
pmf = self.exam.PmfCorrect(efficacy)
like = pmf.Prob(raw)
return like
\end{code}
\py{pmf} is the distribution of raw scores for a test-taker
with the given efficacy; \py{like} is the probability of
the observed score.
Figure~\ref{fig.satposterior2} shows the posterior distributions
of efficacy for Alice and Bob. As expected, the location
of Alice's distribution is farther to the right, but again there
is some overlap.
Using \py{TopLevel} again, we compare $A$, the
hypothesis that Alice's efficacy is higher, and $B$, the
hypothesis that Bob's is higher. The likelihood ratio is
3.4, a bit smaller than what we got from the simple model (3.8).
So this model indicates that the data are evidence in favor
of $A$, but a little weaker than the previous estimate.
If our prior belief is that $A$ and $B$ are equally likely,
then in light of this evidence we would give $A$ a posterior
probability of 77\%, leaving a 23\% chance that Bob's efficacy
is higher.
\section{Predictive distribution}
The analysis we have done so far generates estimates for
Alice and Bob's efficacy, but since efficacy is not directly
observable, it is hard to validate the results.
\index{predictive distribution}
To give the model predictive power, we can use it to answer
a related question: ``If Alice and Bob take the math SAT
again, what is the chance that Alice will do better again?''
We'll answer this question in two steps:
\begin{itemize}
\item We'll use the posterior distribution of efficacy to
generate a predictive distribution of raw score for each test-taker.
\item We'll compare the two predictive distributions to compute
the probability that Alice gets a higher score again.
\end{itemize}
We already have most of the code we need. To compute
the predictive distributions, we can use \verb"MakeRawScoreDist" again:
\begin{code}
exam = Exam()
a_sat = Sat(exam, 780)
b_sat = Sat(exam, 740)
a_pred = exam.MakeRawScoreDist(a_sat)
b_pred = exam.MakeRawScoreDist(b_sat)
\end{code}
Then we can find the likelihood that Alice does better on the second
test, Bob does better, or they tie:
\begin{code}
a_like = thinkbayes.PmfProbGreater(a_pred, b_pred)
b_like = thinkbayes.PmfProbLess(a_pred, b_pred)
c_like = thinkbayes.PmfProbEqual(a_pred, b_pred)
\end{code}
The probability that Alice does better on the second exam is 63\%,
which means that Bob has a 37\% chance of doing as well or better.
Notice that we have more confidence about Alice's efficacy than we do
about the outcome of the next test. The posterior odds are 3:1 that
Alice's efficacy is higher, but only 2:1 that Alice will do better on
the next exam.
\section{Discussion}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/sat_joint.pdf}}
\caption{Joint posterior distribution of \py{p_correct} for Alice and Bob.}
\label{fig.satjoint}
\end{figure}
We started this chapter with the question,
``How strong is the evidence that Alice is better prepared
than Bob?'' On the face of it, that sounds like we want to
test two hypotheses: either Alice is more prepared or Bob is.
But in order to compute likelihoods for these hypotheses, we
have to solve an estimation problem. For each test-taker
we have to find the posterior distribution of either
\verb"p_correct" or \verb"efficacy".
Values like this are called {\bf nuisance parameters} because
we don't care what they are, but we have
to estimate them to answer the question we care about.
\index{nuisance parameter}
One way to visualize the analysis we did in this chapter is
to plot the space of these parameters. \verb"thinkbayes.MakeJoint"
takes two Pmfs, computes their joint distribution, and returns
a joint pmf of each possible pair of values and its probability.
\begin{code}
def MakeJoint(pmf1, pmf2):
joint = Joint()
for v1, p1 in pmf1.Items():
for v2, p2 in pmf2.Items():
joint.Set((v1, v2), p1 * p2)
return joint
\end{code}
This function assumes that the two distributions are independent.
\index{joint distribution}
\index{independence}
Figure~\ref{fig.satjoint} shows the joint posterior distribution of
\verb"p_correct" for Alice and Bob. The diagonal line indicates the
part of the space where \verb"p_correct" is the same for Alice and
Bob. To the right of this line, Alice is more prepared; to the left,
Bob is more prepared.
In \py{TopLevel.Update}, when we compute the likelihoods of $A$ and
$B$, we add up the probability mass on each side of this line. For the
cells that fall on the line, we add up the total mass and split it
between $A$ and $B$.
The process we used in this chapter---estimating nuisance
parameters in order to evaluate the likelihood of competing
hypotheses---is a common Bayesian approach to problems like this.
\chapter{Simulation}
In this chapter I describe my solution to a problem posed
by a patient with a kidney tumor. I think the problem is
important and relevant to patients with these tumors
and doctors treating them.
And I think the solution is interesting because, although it
is a Bayesian approach to the problem, the use of Bayes's theorem
is implicit. I present the solution and my code; at the end
of the chapter I will explain the Bayesian part.
If you want more technical detail than I present here, you can
read my paper on this work at \url{http://arxiv.org/abs/1203.6890}.
\section{The Kidney Tumor problem}
\index{Kidney tumor problem}
\index{Reddit}
I am a frequent reader and occasional contributor to the online statistics
forum at \url{http://reddit.com/r/statistics}. In November 2011, I read
the following message:
\begin{quote}
"I have Stage IV Kidney Cancer and am trying to determine if the
cancer formed before I retired from the military. ... Given the
dates of retirement and detection is it possible to determine when
there was a 50/50 chance that I developed the disease? Is it
possible to determine the probability on the retirement date? My
tumor was 15.5 cm x 15 cm at detection. Grade II."
\end{quote}
I contacted the author of the message and got more information; I learned
that veterans get different benefits if it is "more likely than not"
that a tumor formed while they were in military service (among other
considerations).
Because renal tumors grow slowly, and often do not cause symptoms,
they are sometimes left untreated. As a result, doctors can observe
the rate of growth for untreated tumors by comparing scans from the
same patient at different times. Several papers have reported these
growth rates.
I collected data from a paper by Zhang et al\footnote{Zhang et al,
Distribution of Renal Tumor Growth Rates Determined by Using Serial
Volumetric CT Measurements, January 2009 {\it Radiology}, 250,
137-144.}. I contacted the authors to see if I could get raw data,
but they refused on grounds of medical privacy. Nevertheless, I was
able to extract the data I needed by printing one of their graphs and
measuring it with a ruler.
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/kidney2.pdf}}
\caption{CDF of RDT in doublings per year.}
\label{fig.kidney2}
\end{figure}
They report growth rates in reciprocal doubling time (RDT),
which is in units of doublings per year. So a tumor with $RDT=1$
doubles in volume each year; with $RDT=2$ it quadruples in the same
time, and with $RDT=-1$, it halves. Figure~\ref{fig.kidney2} shows the
distribution of RDT for 53 patients.
\index{doubling time}
The squares are the data points from the paper; the line is a model I
fit to the data. The positive tail fits an exponential distribution
well, so I used a mixture of two exponentials.
\index{exponential distribution}
\index{mixture}
\section{A simple model}
It is usually a good idea to start with a simple model before
trying something more challenging. Sometimes the simple model is
sufficient for the problem at hand, and if not, you can use it
to validate the more complex model.
\index{modeling}
For my simple model, I assume that tumors grow with a constant
doubling time, and that they are three-dimensional in the sense that
if the maximum linear measurement doubles, the volume is multiplied by
eight.
I learned from my correspondent that the time between his discharge
from the military and his diagnosis was 3291 days (about 9 years).
So my first calculation was, ``If this tumor grew at the median
rate, how big would it have been at the date of discharge?''
The median volume doubling time reported by Zhang et al is 811 days.
Assuming 3-dimensional geometry, the doubling time for a linear
measure is three times longer.
\begin{code}
# time between discharge and diagnosis, in days
interval = 3291.0
# doubling time in linear measure is doubling time in volume * 3
dt = 811.0 * 3
# number of doublings since discharge
doublings = interval / dt
# how big was the tumor at time of discharge (diameter in cm)
d1 = 15.5
d0 = d1 / 2.0 ** doublings
\end{code}
You can download the code in this chapter from
\url{http://thinkbayes.com/kidney.py}. For more information
see Section~\ref{download}.
The result, \py{d0}, is about 6 cm. So if this tumor formed after
the date of discharge, it must have grown substantially faster than
the median rate. Therefore I concluded that it is ``more likely than
not'' that this tumor formed before the date of discharge.
In addition, I computed the growth rate that would be implied
if this tumor had formed after the date of discharge. If we
assume an initial size of 0.1 cm, we can compute the number of
doublings to get to a final size of 15.5 cm:
\begin{code}
# assume an initial linear measure of 0.1 cm
d0 = 0.1
d1 = 15.5
# how many doublings would it take to get from d0 to d1
doublings = log2(d1 / d0)
# what linear doubling time does that imply?
dt = interval / doublings
# compute the volumetric doubling time and RDT
vdt = dt / 3
rdt = 365 / vdt
\end{code}
\py{dt} is linear doubling time, so \py{vdt} is volumetric
doubling time, and \py{rdt} is reciprocal doubling
time.
The number of doublings, in linear measure, is 7.3, which implies
an RDT of 2.4. In the data from Zhang et al, only 20\% of tumors
grew this fast during a period of observation. So again,
I concluded that is ``more likely than not'' that the tumor
formed prior to the date of discharge.
These calculations are sufficient to answer the question as
posed, and on behalf of my correspondent, I wrote a letter explaining
my conclusions to the Veterans' Benefit Administration.
\index{Veterans' Benefit Administration}
Later I told a friend, who is an oncologist, about my results. He was
surprised by the growth rates observed by Zhang et al, and by what
they imply about the ages of these tumors. He suggested that the
results might be interesting to researchers and doctors.
But in order to make them useful, I wanted a more general model
of the relationship between age and size.
\section{A more general model}
Given the size of a tumor at time of diagnosis, it would be most
useful to know the probability that the tumor formed before
any given date; in other words, the distribution of ages.
\index{modeling}
\index{simulation}
To find it, I run simulations of tumor growth to get the
distribution of size conditioned on age. Then we can use
a Bayesian approach to get the
distribution of age conditioned on size.
\index{conditional distribution}
The simulation starts with a small tumor and runs these steps:
\begin{enumerate}
\item Choose a growth rate from the distribution of RDT.
\item Compute the size of the tumor at the end of an interval.
\item Record the size of the tumor at each interval.
\item Repeat until the tumor exceeds the maximum relevant size.
\end{enumerate}
For the initial size I chose 0.3 cm, because carcinomas smaller than
that are less likely to be invasive and less likely to have the blood
supply needed for rapid growth (see
\url{http://en.wikipedia.org/wiki/Carcinoma_in_situ}).
\index{carcinoma}
I chose an interval of 245 days (about 8 months) because that is the
median time between measurements in the data source.
For the maximum size I chose 20 cm. In the data source, the range of
observed sizes is 1.0 to 12.0 cm, so we are extrapolating beyond
the observed range at each end, but not by far, and not in a way
likely to have a strong effect on the results.
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/kidney4.pdf}}
\caption{Simulations of tumor growth, size vs. time.}
\label{fig.kidney4}
\end{figure}
The simulation is based on one big simplification:
the growth rate is chosen independently during each interval,
so it does not depend on age, size, or growth rate during
previous intervals.
\index{independence}
In Section~\ref{serial} I review these assumptions and
consider more detailed models. But first let's look at some
examples.
Figure~\ref{fig.kidney4} shows
the size of simulated tumors as a function of
age. The dashed line at 10 cm shows the range of ages for tumors at
that size: the fastest-growing tumor gets there in 8 years; the
slowest takes more than 35.
I am presenting results in terms of linear measurements, but the
calculations are in terms of volume. To convert from one to the
other, again, I use the volume of a sphere with the given
diameter.
\index{volume}
\index{sphere}
\section{Implementation}
Here is the kernel of the simulation:
\index{simulation}
\begin{code}
def MakeSequence(rdt_seq, v0=0.01, interval=0.67, vmax=Volume(20.0)):
seq = v0,
age = 0
for rdt in rdt_seq:
age += interval
final, seq = ExtendSequence(age, seq, rdt, interval)
if final > vmax:
break
return seq
\end{code}
\verb"rdt_seq" is an iterator that yields
random values from the CDF of growth rate.
\py{v0} is the initial volume in mL. \py{interval} is the time step
in years. \py{vmax} is the final volume corresponding to a linear
measurement of 20 cm.
\index{iterator}
\py{Volume} converts from linear measurement in cm to volume
in mL, based on the simplification that the tumor is a sphere:
\begin{code}
def Volume(diameter, factor=4*math.pi/3):
return factor * (diameter/2.0)**3
\end{code}
\py{ExtendSequence} computes the volume of the tumor at the
end of the interval.
\begin{code}
def ExtendSequence(age, seq, rdt, interval):
initial = seq[-1]
doublings = rdt * interval
final = initial * 2**doublings
new_seq = seq + (final,)
cache.Add(age, new_seq, rdt)
return final, new_seq
\end{code}
\py{age} is the age of the tumor at the end of the interval.
\py{seq} is a tuple that contains the volumes so far. \py{rdt} is
the growth rate during the interval, in doublings per year.
\py{interval} is the size of the time step in years.
The return values are \py{final}, the volume of the
tumor at the end of the interval, and \verb"new_seq", a new
tuple containing the volumes in \py{seq} plus the new volume
\py{final}.
\py{Cache.Add} records the age and size of each tumor at the end
of each interval, as explained in the next section.
\index{cache}
\section{Caching the joint distribution}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/kidney8.pdf}}
\caption{Joint distribution of age and tumor size.}
\label{fig.kidney8}
\end{figure}
Here's how the cache works.
\begin{code}
class Cache(object):
def __init__(self):
self.joint = thinkbayes.Joint()
\end{code}
\py{joint} is a joint Pmf that records the
frequency of each age-size pair, so it approximates the
joint distribution of age and size.
\index{joint distribution}
At the end of each simulated interval, \py{ExtendSequence} calls
\py{Add}:
\begin{code}
# class Cache
def Add(self, age, seq):
final = seq[-1]
cm = Diameter(final)
bucket = round(CmToBucket(cm))
self.joint.Incr((age, bucket))
\end{code}
Again, \py{age} is the age of the tumor, and \py{seq} is the
sequence of volumes so far.
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/kidney6.pdf}}
\caption{Distributions of age, conditioned on size.}
\label{fig.kidney6}
\end{figure}
Before adding the new data to the joint distribution, we use {\tt
Diameter} to convert from volume to diameter in centimeters:
\begin{code}
def Diameter(volume, factor=3/math.pi/4, exp=1/3.0):
return 2 * (factor * volume) ** exp
\end{code}
And
\py{CmToBucket} to convert from centimeters to a discrete bucket
number:
\begin{code}
def CmToBucket(x, factor=10):
return factor * math.log(x)
\end{code}
The buckets are equally spaced on a log scale. Using \py{factor=10}
yields a reasonable number of buckets; for example,
1 cm maps to bucket 0 and 10 cm maps to bucket 23.
\index{log scale}
\index{bucket}
After running the simulations, we can plot the joint distribution
as a pseudocolor plot, where each cell represents the number of
tumors observed at a given size-age pair.
Figure~\ref{fig.kidney8} shows the joint distribution after 1000
simulations.
\index{pseudocolor plot}
\section{Conditional distributions}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/kidney7.pdf}}
\caption{Percentiles of tumor age as a function of size.}
\label{fig.kidney7}
\end{figure}
By taking a vertical slice from the joint distribution, we can get the
distribution of sizes for any given age. By taking a horizontal
slice, we can get the distribution of ages conditioned on size.
\index{conditional distribution}
Here's the code that reads the joint distribution and builds
the conditional distribution for a given size.
\index{joint distribution}
\begin{code}
# class Cache
def ConditionalCdf(self, bucket):
pmf = self.joint.Conditional(0, 1, bucket)
cdf = pmf.MakeCdf()
return cdf
\end{code}
\verb"bucket" is the integer bucket number corresponding to
tumor size. \py{Joint.Conditional} computes the
PMF of age conditioned on \py{bucket}.
The result is the CDF of age conditioned on \py{bucket}.
Figure~\ref{fig.kidney6} shows several of these CDFs, for
a range of sizes. To summarize these distributions, we can
compute percentiles as a function of size.
\index{percentile}
\begin{code}
percentiles = [95, 75, 50, 25, 5]
for bucket in cache.GetBuckets():
cdf = ConditionalCdf(bucket)
ps = [cdf.Percentile(p) for p in percentiles]
\end{code}
Figure~\ref{fig.kidney7} shows these percentiles for each
size bucket. The data points are computed from the estimated
joint distribution. In the model, size and time are discrete,
which contributes numerical errors, so I also show a least
squares fit for each sequence of percentiles.
\index{least squares fit}
\section{Serial Correlation}
\label{serial}
The results so far are based on a number of modeling decisions;
let's review them and consider which ones are the most
likely sources of error:
\index{modeling error}
\begin{itemize}
\item To convert from linear measure to volume, we assume that
tumors are approximately spherical. This assumption is probably
fine for tumors up to a few centimeters, but not for very
large tumors.
\index{sphere}
\item The distribution of growth rates in the simulations are based on
a continuous model we chose to fit the data reported by Zhang et al,
which is based on 53 patients. The fit is only approximate and, more
importantly, a larger sample would yield a
different distribution.
\index{growth rate}
\item The growth model does not take into account tumor subtype or
grade; this assumption is consistent with the conclusion of Zhang et al:
``Growth rates in renal tumors of different sizes, subtypes and
grades represent a wide range and overlap substantially.''
But with a larger sample, a difference might become apparent.
\index{tumor type}
\item The distribution of growth rate does not depend on the size of
the tumor. This assumption would not be realistic for very
small and very large tumors, whose growth is limited by blood supply.
But tumors observed by Zhang et al ranged from 1 to 12 cm, and they
found no statistically significant relationship between
size and growth rate. So if there is a relationship, it is
likely to be weak, at least in this size range.
\item In the simulations, growth rate during each interval is
independent of previous growth rates. In reality it is plausible
that tumors that have grown quickly in the past are more likely
to grow quickly. In other words, there is probably
a serial correlation in growth rate.
\index{serial correlation}
\end{itemize}
Of these, the first and last seem the most problematic. I'll
investigate serial correlation first, then come back to
spherical geometry.
To simulate correlated growth, I wrote a generator\footnote{If you are
not familiar with Python generators, see
\url{http://wiki.python.org/moin/Generators}.} that yields a
correlated series from a given Cdf. Here's how the algorithm works:
\index{generator}
\begin{enumerate}
\item Generate correlated values from a Gaussian distribution.
This is easy to do because we can compute the distribution
of the next value conditioned on the previous value.
\index{Gaussian distribution}
\item Transform each value to its cumulative probability using
the Gaussian CDF.
\index{cumulative probability}
\item Transform each cumulative probability to the corresponding value
using the given Cdf.
\end{enumerate}
Here's what that looks like in code:
\begin{code}
def CorrelatedGenerator(cdf, rho):
x = random.gauss(0, 1)
yield Transform(x)
sigma = math.sqrt(1 - rho**2);
while True:
x = random.gauss(x * rho, sigma)
yield Transform(x)
\end{code}
\py{cdf} is the desired Cdf; \py{rho} is the desired correlation.
The values of \py{x} are Gaussian; \py{Transform} converts them
to the desired distribution.
The first value of \py{x} is Gaussian with mean 0 and standard
deviation 1. For subsequent values, the mean and standard deviation
depend on the previous value. Given the previous \py{x}, the mean of the
next value is \py{x * rho}, and the variance is \py{1 - rho**2}.
\index{correlated random value}
\py{Transform} maps from each
Gaussian value, \py{x}, to a value from the given Cdf, \py{y}.
\begin{code}
def Transform(x):
p = thinkbayes.GaussianCdf(x)
y = cdf.Value(p)
return y
\end{code}
\py{GaussianCdf} computes the CDF of the standard Gaussian
distribution at \py{x}, returning a cumulative probability.
\py{Cdf.Value} maps from a cumulative probability to the
corresponding value in \py{cdf}.
Depending on the shape of \py{cdf}, information can
be lost in transformation, so the actual correlation might be
lower than \py{rho}. For example, when I generate
10000 values from the distribution of growth rates with
\py{rho=0.4}, the actual correlation is 0.37.
But since we are guessing at the right correlation anyway,
that's close enough.
Remember that \py{MakeSequence} takes an iterator as an argument.
That interface allows it to work with different generators:
\index{generator}
\begin{code}
iterator = UncorrelatedGenerator(cdf)
seq1 = MakeSequence(iterator)
iterator = CorrelatedGenerator(cdf, rho)
seq2 = MakeSequence(iterator)
\end{code}
In this example, \py{seq1} and \py{seq2} are
drawn from the same distribution, but the values in \py{seq1}
are uncorrelated and the values in \py{seq2} are correlated
with a coefficient of approximately \py{rho}.
\index{serial correlation}
Now we can see what effect serial correlation has on the results;
the following table shows percentiles of age for a 6 cm tumor,
using the uncorrelated generator and a correlated generator
with target $\rho = 0.4$.
\index{percentile}
\begin{table}
\input{tables/kidney_table2}
\caption{Percentiles of tumor age conditioned on size.}
\end{table}
Correlation makes the fastest growing tumors faster and the slowest
slower, so the range of ages is wider. The difference is modest for
low percentiles, but for the 95th percentile it is more than 6 years.
To compute these percentiles precisely, we would need a better
estimate of the actual serial correlation.
However, this model is sufficient to answer the question
we started with: given a tumor with a linear dimension of
15.5 cm, what is the probability that it formed more than
8 years ago?
Here's the code:
\begin{code}
# class Cache
def ProbOlder(self, cm, age):
bucket = CmToBucket(cm)
cdf = self.ConditionalCdf(bucket)
p = cdf.Prob(age)
return 1-p
\end{code}
\py{cm} is the size of the tumor; \py{age} is the age threshold
in years. \py{ProbOlder} converts size to a bucket number,
gets the Cdf of age conditioned on bucket, and computes the
probability that age exceeds the given value.
With no serial correlation, the probability that a
15.5 cm tumor is older than 8 years is 0.999, or almost certain.
With correlation 0.4, faster-growing tumors are more likely, but
the probability is still 0.995. Even with correlation 0.8, the
probability is 0.978.
Another likely source of error is the assumption that tumors are
approximately spherical. For a tumor with linear dimensions 15.5 x 15
cm, this assumption is probably not valid. If, as seems likely, a
tumor this size
is relatively flat, it might have the same volume as a 6 cm sphere.
With this smaller volume and correlation 0.8, the probability of age
greater than 8 is still 95\%.
So even taking into account modeling errors, it is unlikely that such
a large tumor could have formed less than 8 years prior to the date of
diagnosis.
\index{modeling error}
\section{Discussion}
Well, we got through a whole chapter without using Bayes's theorem or
the \py{Suite} class that encapsulates Bayesian updates. What
happened?
One way to think about Bayes's theorem is as an algorithm for
inverting conditional probabilities. Given \p{B|A}, we can compute
\p{A|B}, provided we know \p{A} and \p{B}. Of course this algorithm
is only useful if, for some reason, it is easier to compute \p{B|A}
than \p{A|B}.
In this example, it is. By running simulations, we can estimate the
distribution of size conditioned on age, or \p{size|age}. But it is
harder to get the distribution of age conditioned on size, or
\p{age|size}. So this seems like a perfect opportunity to use Bayes's
theorem.
The reason I didn't is computational efficiency. To estimate
\p{size|age} for any given size, you have to run a lot of simulations.
Along the way, you end up computing \p{size|age} for a lot of sizes.
In fact, you end up computing the entire joint distribution of size
and age, \p{size, age}.
\index{joint distribution}
And once you have the joint distribution, you don't really need
Bayes's theorem, you can extract \p{age|size} by taking slices from
the joint distribution, as demonstrated in \py{ConditionalCdf}.
\index{conditional distribution}
So we side-stepped Bayes, but he was with us in spirit.
\chapter{A Hierarchical Model}
\label{hierarchical}
\section{The Geiger counter problem}
I got the idea for the following problem from Tom Campbell-Ricketts,
author of the Maximum Entropy blog at
\url{http://maximum-entropy-blog.blogspot.com}. And he got the idea
from E.~T.~Jaynes, author of the classic {\em Probability Theory: The
Logic of Science}:
\index{Jaynes, E.~T.}
\index{Campbell-Ricketts, Tom}
\index{Geiger counter problem}
\begin{quote}
Suppose that a radioactive source emits particles toward
a Geiger counter at an average rate of $r$ particles per second,
but the counter only registers a fraction, $f$, of the particles
that hit it. If $f$ is 10\% and
the counter registers 15 particles in a one second
interval, what is the posterior distribution of $n$, the actual
number of particles that hit the counter, and $r$, the average
rate particles are emitted?
\end{quote}
To get started on a problem like this, think about the chain of
causation that starts with the parameters of the system and ends
with the observed data:
\index{causation}
\begin{enumerate}
\item The source emits particles at an average rate, $r$.
\item During any given second, the source emits $n$ particles
toward the counter.
\item Out of those $n$ particles, some number, $k$, get counted.
\end{enumerate}
The probability that an atom decays is the same at any point in time,
so radioactive decay is well modeled by a Poisson process. Given $r$,
the distribution of $n$ is Poisson distribution with parameter $r$.
\index{radioactive decay}
\index{Poisson process}
And if we assume that the probability of detection for each particle
is independent of the others, the distribution of $k$ is the binomial
distribution with parameters $n$ and $f$.
\index{binomial distribution}
Given the parameters of the system, we can find the distribution of
the data. So we can solve what is called the {\bf forward problem}.
\index{forward problem}
Now we want to go the other way: given the data, we
want the distribution of the parameters. This is called
the {\bf inverse problem}. And if you can solve the forward
problem, you can use Bayesian methods to solve the inverse problem.
\index{inverse problem}
\section{Start simple}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/jaynes1.pdf}}
\caption{Posterior distribution of $n$ for three values of $r$.}
\label{fig.jaynes1}
\end{figure}
Let's start with a simple version of the problem where we know
the value of $r$. We are given the value of $f$, so all we
have to do is estimate $n$.
I define a Suite called \py{Detector} that models the behavior
of the detector and estimates $n$.
\begin{code}
class Detector(thinkbayes.Suite):
def __init__(self, r, f, high=500, step=1):
pmf = thinkbayes.MakePoissonPmf(r, high, step=step)
thinkbayes.Suite.__init__(self, pmf, name=r)
self.r = r
self.f = f
\end{code}
If the average emission rate is $r$ particles per second, the
distribution of $n$ is Poisson with parameter $r$.
\py{high} and \py{step} determine the upper bound for $n$
and the step size between hypothetical values.
\index{Poisson distribution}
Now we need a likelihood function:
\index{likelihood}
\begin{code}
# class Detector
def Likelihood(self, data, hypo):
k = data
n = hypo
p = self.f
return thinkbayes.EvalBinomialPmf(k, n, p)
\end{code}
\py{data} is the number of particles detected, and \py{hypo} is
the hypothetical number of particles emitted, $n$.
If there are actually $n$ particles, and the probability of detecting
any one of them is $f$, the probability of detecting $k$ particles is
given by the binomial distribution.
\index{binomial distribution}
That's it for the Detector. We can try it out for a range
of values of $r$:
\begin{code}
f = 0.1
k = 15
for r in [100, 250, 400]:
suite = Detector(r, f, step=1)
suite.Update(k)
print suite.MaximumLikelihood()
\end{code}
Figure~\ref{fig.jaynes1} shows the posterior distribution of $n$ for
several given values of $r$.
\section{Make it hierarchical}
In the previous section, we assume $r$ is known. Now let's
relax that assumption. I define another Suite, called \py{Emitter},
that models the behavior of the emitter and estimates $r$:
\begin{code}
class Emitter(thinkbayes.Suite):
def __init__(self, rs, f=0.1):
detectors = [Detector(r, f) for r in rs]
thinkbayes.Suite.__init__(self, detectors)
\end{code}
\py{rs} is a sequence of hypothetical value for $r$. \py{detectors}
is a sequence of Detector objects, one for each value of $r$. The
values in the Suite are Detectors, so Emitter is a {\bf meta-Suite};
that is, a Suite that contains other Suites as values.
\index{meta-Suite}
To update the Emitter, we have to compute the likelihood of the data
under each hypothetical value of $r$. But each value of $r$ is
represented by a Detector that contains a range of values for $n$.
To compute the likelihood of the data for a given Detector, we loop
through the values of $n$ and add up the total probability of $k$.
That's what \py{SuiteLikelihood} does:
\begin{code}
# class Detector
def SuiteLikelihood(self, data):
total = 0
for hypo, prob in self.Items():
like = self.Likelihood(data, hypo)
total += prob * like
return total
\end{code}
Now we can write the Likelihood function for the Emitter:
\begin{code}
# class Emitter
def Likelihood(self, data, hypo):
detector = hypo
like = detector.SuiteLikelihood(data)
return like
\end{code}
Each \py{hypo} is a Detector, so we can invoke
\py{SuiteLikelihood} to get the likelihood of the data under
the hypothesis.
After we update the Emitter, we have to update each of the
Detectors, too.
\begin{code}
# class Emitter
def Update(self, data):
thinkbayes.Suite.Update(self, data)
for detector in self.Values():
detector.Update()
\end{code}
A model like this, with multiple levels of Suites, is called {\bf
hierarchical}. \index{hierarchical model}
\section{A little optimization}
You might recognize \py{SuiteLikelihood}; we saw it
in Section~\ref{suitelike}. At the time, I pointed out that
we didn't really need it, because the total probability
computed by \py{SuiteLikelihood} is exactly the normalizing
constant computed and returned by \py{Update}.
\index{normalizing constant}
So instead of updating the Emitter and then updating the
Detectors, we can do both steps at the same time, using
the result from \py{Detector.Update} as the likelihood
of Emitter.
Here's the streamlined version of \py{Emitter.Likelihood}:
\begin{code}
# class Emitter
def Likelihood(self, data, hypo):
return hypo.Update(data)
\end{code}
And with this version of \py{Likelihood} we can use the
default version of \py{Update}. So this version has fewer
lines of code, and it runs faster because it does not compute
the normalizing constant twice.
\index{optimization}
\section{Extracting the posteriors}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/jaynes2.pdf}}
\caption{Posterior distributions of $n$ and $r$.}
\label{fig.jaynes2}
\end{figure}
After we update the Emitter, we can get the posterior distribution
of $r$ by looping through the Detectors and their probabilities:
\begin{code}
# class Emitter
def DistOfR(self):
items = [(detector.r, prob) for detector, prob in self.Items()]
return thinkbayes.MakePmfFromItems(items)
\end{code}
\py{items} is a list of values of $r$ and their probabilities.
The result is the Pmf of $r$.
To get the posterior distribution of $n$, we have to compute
the mixture of the Detectors. We can use
\py{thinkbayes.MakeMixture}, which takes a meta-Pmf that maps
from each distribution to its probability. And that's exactly
what the Emitter is:
\begin{code}
# class Emitter
def DistOfN(self):
return thinkbayes.MakeMixture(self)
\end{code}
Figure~\ref{fig.jaynes2} shows the results. Not surprisingly, the
most likely value for $n$ is 150. Given $f$ and $n$, the expected
count is $k = f n$, so given $f$ and $k$, the expected value of $n$ is
$k / f$, which is 150.
And if 150 particles are emitted in one second, the most likely value
of $r$ is 150 particles per second. So the posterior distribution of
$r$ is also centered on 150.
The posterior distributions of $r$ and $n$ are similar;
the only difference is that we are slightly less certain about $n$.
In general, we can be more certain about the long-range emission rate,
$r$, than about the number of particles emitted in any particular second,
$n$.
You can download the code in this chapter from
\url{http://thinkbayes.com/jaynes.py}. For more information see
Section~\ref{download}.
\section{Discussion}
The Geiger counter problem demonstrates the connection between
causation and hierarchical modeling. In the example, the
emission rate $r$ has a causal effect on the number of particles,
$n$, which has a causal effect on the particle count, $k$.
\index{Geiger counter problem}
\index{causation}
The hierarchical model reflects the structure of the
system, with causes at the top and effects at the bottom.
\index{hierarchical model}
\begin{enumerate}
\item At the top level, we start with a range of hypothetical
values for $r$.
\item For each value of $r$, we have a range of values for $n$,
and the prior distribution of $n$ depends on $r$.
\item When we update the model, we go bottom-up. We compute
a posterior distribution of $n$ for each value of $r$, then
compute the posterior distribution of $r$.
\end{enumerate}
So causal information flows down the hierarchy, and inference flows
up.
\section{Exercises}
\begin{exercise}
This exercise is also inspired by an example in Jaynes, {\em
Probability Theory}.
Suppose you buy a mosquito trap that is supposed to reduce the
population of mosquitoes near your house. Each
week, you empty the trap and count the number of mosquitoes
captured. After the first week, you count 30 mosquitoes.
After the second week, you count 20 mosquitoes. Estimate the
percentage change in the number of mosquitoes in your yard.
To answer this question, you have to make some modeling
decisions. Here are some suggestions:
\begin{itemize}
\item Suppose that each week a large number of mosquitoes, $N$, is bred
in a wetland near your home.
\item During the week, some fraction of
them, $f_1$, wander into your yard, and of those some fraction, $f_2$,
are caught in the trap.
\item Your solution should take into account your prior belief
about how much $N$ is likely to change from one week to the next.
You can do that by adding a level to the hierarchy to
model the percent change in $N$.
\end{itemize}
\end{exercise}
\chapter{Dealing with Dimensions}
\label{species}
\section{Belly button bacteria}
Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen
science project with the goal of identifying bacterial species that
can be found in human navels (\url{http://bbdata.yourwildlife.org}).
The project might seem whimsical, but it is part of an increasing
interest in the human microbiome, the set of microorganisms that live
on human skin and parts of the body.
\index{biodiversity}
\index{belly button}
\index{bacteria}
\index{microbiome}
In their pilot study, BBB2 researchers collected swabs from the navels
of 60 volunteers, used multiplex pyrosequencing to extract and sequence
fragments of 16S rDNA, then identified the species or genus the
fragments came from. Each identified fragment is called a ``read.''
\index{navel}
\index{rDNA}
\index{pyrosequencing}
We can use these data to answer several related questions:
\begin{itemize}
\item Based on the number of species observed, can we estimate
the total number of species in the environment?
\index{species}
\item Can we estimate the prevalence of each species; that is, the
fraction of the total population belonging to each species?
\index{prevalence}
\item If we are planning to collect additional samples, can we predict
how many new species we are likely to discover?
\item How many additional reads are needed to increase the
fraction of observed species to a given threshold?
\end{itemize}
These questions make up what is called the {\bf Unseen Species problem}.
\index{Unseen Species problem}
\section{Lions and tigers and bears}
I'll start with a simplified version of the problem where we know that
there are exactly three species. Let's call them lions, tigers and
bears. Suppose we visit a wild animal preserve and see 3 lions, 2
tigers and one bear.
\index{lions and tigers and bears}
If we have an equal chance of observing any animal in the preserve,
the number of each species we see is governed by the multinomial
distribution. If the prevalence of lions and tigers and bears is
\verb"p_lion" and \verb"p_tiger" and \verb"p_bear", the likelihood of
seeing 3 lions, 2 tigers and one bear is proportional to
\index{multinomial distribution}
\begin{code}
p_lion**3 * p_tiger**2 * p_bear**1
\end{code}
An approach that is tempting, but not correct, is to use beta
distributions, as in Section~\ref{beta}, to describe the prevalence of
each species separately. For example, we saw 3 lions and 3 non-lions;
if we think of that as 3 ``heads'' and 3 ``tails,'' then the posterior
distribution of \verb"p_lion" is:
\index{beta distribution}
\begin{code}
beta = thinkbayes.Beta()
beta.Update((3, 3))
print beta.MaximumLikelihood()
\end{code}
The maximum likelihood estimate for \verb"p_lion" is the observed
rate, 50\%. Similarly the MLEs for \verb"p_tiger" and \verb"p_bear"
are 33\% and 17\%.
\index{maximum likelihood}
But there are two problems:
\begin{enumerate}
\item We have implicitly used a prior for each species that is uniform
from 0 to 1, but since we know that there are three species, that
prior is not correct. The right prior should have a mean of 1/3,
and there should be zero likelihood that any species has a
prevalence of 100\%.
\item The distributions for each species are not independent, because
the prevalences have to add up to 1. To capture this dependence, we
need a joint distribution for the three prevalences.
\index{independence}
\index{joint distribution}
\end{enumerate}
We can use a Dirichlet distribution to solve both of these problems
(see \url{http://en.wikipedia.org/wiki/Dirichlet_distribution}). In
the same way we used the beta distribution to describe the
distribution of bias for a coin, we can use a Dirichlet
distribution to describe the joint distribution of \verb"p_lion",
\verb"p_tiger" and \verb"p_bear".
\index{beta distribution}
\index{Dirichlet distribution}
The Dirichlet distribution is the multi-dimensional generalization
of the beta distribution. Instead of two possible outcomes, like
heads and tails, the Dirichlet distribution handles any number of
outcomes: in this example, three species.
If there are \py{n} outcomes, the Dirichlet distribution is
described by \py{n} parameters, written $\alpha_1$ through $\alpha_n$.
Here's the definition, from \py{thinkbayes.py}, of a class that
represents a Dirichlet distribution:
\index{numpy}
\begin{code}
class Dirichlet(object):
def __init__(self, n):
self.n = n
self.params = numpy.ones(n, dtype=numpy.int)
\end{code}
\py{n} is the number of dimensions; initially the parameters
are all 1. I use a \py{numpy} array to store the parameters
so I can take advantage of array operations.
Given a Dirichlet distribution, the marginal distribution
for each prevalence is a beta distribution, which we can
compute like this:
\begin{code}
def MarginalBeta(self, i):
alpha0 = self.params.sum()
alpha = self.params[i]
return Beta(alpha, alpha0-alpha)
\end{code}
\py{i} is the index of the marginal distribution we want.
\py{alpha0} is the sum of the parameters; \py{alpha} is the
parameter for the given species.
\index{marginal distribution}
In the example, the prior marginal distribution for each species
is \py{Beta(1, 2)}. We can compute the prior means like
this:
\begin{code}
dirichlet = thinkbayes.Dirichlet(3)
for i in range(3):
beta = dirichlet.MarginalBeta(i)
print beta.Mean()
\end{code}
As expected, the prior mean prevalence for each species is 1/3.
To update the Dirichlet distribution, we add the
observations to the parameters like this:
\begin{code}
def Update(self, data):
m = len(data)
self.params[:m] += data
\end{code}
Here \py{data} is a sequence of counts in the same order as {\tt
params}, so in this example, it should be the number of lions,
tigers and bears.
\py{data} can be shorter than \py{params}; in that
case there are some species that have not been
observed.
Here's code that updates \py{dirichlet} with the observed data and
computes the posterior marginal distributions.
\begin{code}
data = [3, 2, 1]
dirichlet.Update(data)
for i in range(3):
beta = dirichlet.MarginalBeta(i)
pmf = beta.MakePmf()
print i, pmf.Mean()
\end{code}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/species1.pdf}}
\caption{Distribution of prevalences for three species.}
\label{fig.species1}
\end{figure}
Figure~\ref{fig.species1} shows the results. The posterior
mean prevalences are 44\%, 33\%, and 22\%.
\section{The hierarchical version}
We have solved a simplified version of the problem: if we
know how many species there are, we can estimate the prevalence
of each.
\index{prevalence}
Now let's get back to the original problem, estimating the total
number of species. To solve this problem I'll define a meta-Suite,
which is a Suite that contains other Suites as hypotheses. In this
case, the top-level Suite contains hypotheses about the number of
species; the bottom level contains hypotheses about prevalences.
\index{hierarchical model}
\index{meta-Suite}
Here's the class definition:
\begin{code}
class Species(thinkbayes.Suite):
def __init__(self, ns):
hypos = [thinkbayes.Dirichlet(n) for n in ns]
thinkbayes.Suite.__init__(self, hypos)
\end{code}
\verb"__init__" takes a list of possible values for \py{n} and
makes a list of Dirichlet objects.
Here's the code that creates the top-level suite:
\begin{code}
ns = range(3, 30)
suite = Species(ns)
\end{code}
\py{ns} is the list of possible values for \py{n}. We have seen 3
species, so there have to be at least that many. I chose an upper
bound that seems reasonable, but we will check later that the
probability of exceeding this bound is low. And at least initially
we assume that any value in this range is equally likely.
To update a hierarchical model, you have to update all levels.
Usually you have to update the bottom
level first and work up, but in this case we can
update the top level first:
\begin{code}
#class Species
def Update(self, data):
thinkbayes.Suite.Update(self, data)
for hypo in self.Values():
hypo.Update(data)
\end{code}
\py{Species.Update} invokes \py{Update} in the parent class,
then loops through the sub-hypotheses and updates them.
Now all we need is a likelihood function:
\begin{code}
# class Species
def Likelihood(self, data, hypo):
dirichlet = hypo
like = 0
for i in range(1000):
like += dirichlet.Likelihood(data)
return like
\end{code}
\py{data} is a sequence of
observed counts; \py{hypo} is a Dirichlet object.
\py{Species.Likelihood} calls
\py{Dirichlet.Likelihood} 1000 times and returns the total.
Why call it 1000 times? Because {\tt
Dirichlet.Likelihood} doesn't actually compute the likelihood of the
data under the whole Dirichlet distribution. Instead, it draws one
sample from the hypothetical distribution and computes the likelihood
of the data under the sampled set of prevalences.
Here's what it looks like:
\begin{code}
# class Dirichlet
def Likelihood(self, data):
m = len(data)
if self.n < m:
return 0
x = data
p = self.Random()
q = p[:m]**x
return q.prod()
\end{code}
The length of \py{data} is the number of species observed. If
we see more species than we thought existed, the likelihood is 0.
\index{multinomial distribution}
Otherwise we select a random set of prevalences, \py{p}, and
compute the multinomial PMF, which is
\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]
$p_i$ is the prevalence of the $i$th species, and $x_i$ is the
observed number. The first term, $c_x$, is the multinomial
coefficient; I leave it out of the computation because it is
a multiplicative factor that depends only
on the data, not the hypothesis, so it gets normalized away
(see \url{http://en.wikipedia.org/wiki/Multinomial_distribution}).
\index{multinomial coefficient}
\py{m} is the number of observed species.
We only need the first \py{m} elements of \py{p};
for the others, $x_i$ is 0, so
$p_i^{x_i}$ is 1, and we can leave them out of the product.
\section{Random sampling}
\label{randomdir}
There are two ways to generate a random sample from a Dirichlet
distribution. One is to use the marginal beta distributions, but in
that case you have to select one at a time and scale the rest so they
add up to 1 (see
\url{http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation}).
\index{random sample}
A less obvious, but faster, way is to select values from \py{n} gamma
distributions, then normalize by dividing through by the total.
Here's the code:
\index{numpy}
\index{gamma distribution}
\begin{code}
# class Dirichlet
def Random(self):
p = numpy.random.gamma(self.params)
return p / p.sum()
\end{code}
Now we're ready to look at some results. Here is the code that extracts
the posterior distribution of \py{n}:
\begin{code}
def DistOfN(self):
pmf = thinkbayes.Pmf()
for hypo, prob in self.Items():
pmf.Set(hypo.n, prob)
return pmf
\end{code}
\py{DistOfN} iterates
through the top-level hypotheses and accumulates the probability
of each \py{n}.
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/species2.pdf}}
\caption{Posterior distribution of \py{n}.}
\label{fig.species2}
\end{figure}
Figure~\ref{fig.species2} shows the result. The most likely value is 4.
Values from 3 to 7 are reasonably likely; after that the probabilities
drop off quickly. The probability that there are 29 species is
low enough to be negligible; if we chose a higher bound,
we would get nearly the same result.
Remember that this result is based on a uniform prior for \py{n}. If
we have background information about the number of species in the
environment, we might choose a different prior. \index{uniform
distribution}
\section{Optimization}
I have to admit that I am proud of this example. The Unseen Species
problem is not easy, and I think this solution is simple and clear,
and takes surprisingly few lines of code (about 50 so far).
The only problem is that it is slow. It's good enough for the example
with only 3 observed species, but not good enough for the belly button
data, with more than 100 species in some samples.
The next few sections present a series of optimizations we need to
make this solution scale. Before we get into the details, here's
a road map.
\index{optimization}
\begin{itemize}
\item The first step is to recognize that if we update the Dirichlet
distributions with the same data, the first \py{m} parameters are
the same for all of them. The only difference is the number of
hypothetical unseen species. So we don't really need \py{n}
Dirichlet objects; we can store the parameters in the top level of
the hierarchy. \py{Species2} implements this optimization.
\item \py{Species2} also uses the same set of random values for all
of the hypotheses. This saves time generating random values, but it
has a second benefit that turns out to be more important: by giving
all hypotheses the same selection from the sample space, we make
the comparison between the hypotheses more fair, so it takes
fewer iterations to converge.
\item Even with these changes there is a major performance problem.
As the number of observed species increases, the array of random
prevalences gets bigger, and the chance of choosing one that is
approximately right becomes small. So the vast majority of
iterations yield small likelihoods that don't contribute much to the
total, and don't discriminate between hypotheses.
The solution is to do the updates one species at a time. {\tt
Species4} is a simple implementation of this strategy using
Dirichlet objects to represent the sub-hypotheses.
\item Finally, \py{Species5} combines the sub-hypotheses into the top
level and uses \py{numpy} array operations to speed things up.
\index{numpy}
\end{itemize}
If you are not interested in the details, feel free to skip to
Section~\ref{belly} where we look at results from the belly
button data.
\section{Collapsing the hierarchy}
\label{collapsing}
All of the bottom-level Dirichlet distributions are updated
with the same data, so the first \py{m} parameters are the same for
all of them.
We can eliminate them and merge the parameters into
the top-level suite. \py{Species2} implements this optimization:
\index{numpy}
\begin{code}
class Species2(object):
def __init__(self, ns):
self.ns = ns
self.probs = numpy.ones(len(ns), dtype=numpy.double)
self.params = numpy.ones(self.high, dtype=numpy.int)
\end{code}
\py{ns} is the list of hypothetical values for \py{n};
\py{probs} is the list of corresponding probabilities. And
\py{params} is the sequence of Dirichlet parameters, initially
all 1.
\py{Species2.Update} updates both levels of
the hierarchy: first the probability for each value of \py{n},
then the Dirichlet parameters:
\index{numpy}
\begin{code}
# class Species2
def Update(self, data):
like = numpy.zeros(len(self.ns), dtype=numpy.double)
for i in range(1000):
like += self.SampleLikelihood(data)
self.probs *= like
self.probs /= self.probs.sum()
m = len(data)
self.params[:m] += data
\end{code}
\py{SampleLikelihood} returns an array of likelihoods, one for each
value of \py{n}. \py{like} accumulates the total likelihood for
1000 samples. \py{self.probs} is multiplied by the total likelihood,
then normalized. The last two lines, which update the parameters,
are the same as in \py{Dirichlet.Update}.
Now let's look at \py{SampleLikelihood}. There are two
opportunities for optimization here:
\begin{itemize}
\item When the hypothetical number of species, \py{n},
exceeds the observed number, \py{m}, we only need the first \py{m}
terms of the multinomial PMF; the rest are 1.
\item If the number of species is large, the likelihood of the data
might be too small for floating-point (see ~\ref{underflow}). So it
is safer to compute log-likelihoods.
\index{log-likelihood} \index{underflow}
\end{itemize}
\index{multinomial distribution}
Again, the multinomial PMF is
\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]
So the log-likelihood is
\[ \log c_x + x_1 \log p_1 + \cdots + x_n \log p_n \]
which is fast and easy to compute. Again, $c_x$
it is the same for all hypotheses, so we can drop it.
Here's the code:
\index{numpy}
\begin{code}
# class Species2
def SampleLikelihood(self, data):
gammas = numpy.random.gamma(self.params)
m = len(data)
row = gammas[:m]
col = numpy.cumsum(gammas)
log_likes = []
for n in self.ns:
ps = row / col[n-1]
terms = data * numpy.log(ps)
log_like = terms.sum()
log_likes.append(log_like)
log_likes -= numpy.max(log_likes)
likes = numpy.exp(log_likes)
coefs = [thinkbayes.BinomialCoef(n, m) for n in self.ns]
likes *= coefs
return likes
\end{code}
\py{gammas} is an array of values from a gamma distribution; its
length is the largest hypothetical value of \py{n}. \py{row} is
just the first \py{m} elements of \py{gammas}; since these are the
only elements that depend on the data, they are the only ones we need.
\index{gamma distribution}
For each value of \py{n} we need to divide \py{row} by the
total of the first \py{n} values from \py{gamma}. \py{cumsum}
computes these cumulative sums and stores them in \py{col}.
\index{cumulative sum}
The loop iterates through the values of \py{n} and accumulates
a list of log-likelihoods.
\index{log-likelihood}
Inside the loop, \py{ps} contains the row of probabilities, normalized
with the appropriate cumulative sum. \py{terms} contains the
terms of the summation, $x_i \log p_i$, and \verb"log_like" contains
their sum.
After the loop, we want to convert the log-likelihoods to linear
likelihoods, but first it's a good idea to shift them so the largest
log-likelihood is 0; that way the linear likelihoods are not too
small (see ~\ref{underflow}).
Finally, before we return the likelihood, we have to apply a correction
factor, which is the number of ways we could have observed these \py{m}
species, if the total number of species is \py{n}.
\py{BinomialCoefficient} computes ``n choose m'', which is written
$\binom{n}{m}$.
\index{binomial coefficient}
As often happens, the optimized version is less readable and more
error-prone than the original. But that's one reason I think it is
a good idea to start with the simple version; we can use it for
regression testing. I plotted results from both versions and confirmed
that they are approximately equal, and that they converge as the
number of iterations increases.
\index{regression testing}
\section{One more problem}
There's more we could do to optimize this code, but there's another
problem we need to fix first. As the number of observed
species increases, this version gets noisier and takes more
iterations to converge on a good answer.
The problem is that if the prevalences we choose from the Dirichlet
distribution, the \py{ps}, are not at least approximately right,
the likelihood of the observed data is close to zero and almost
equally bad for all values of \py{n}. So most iterations don't
provide any useful contribution to the total likelihood. And as the
number of observed species, \py{m}, gets large, the probability of
choosing \py{ps} with non-negligible likelihood gets small. Really
small.
Fortunately, there is a solution. Remember that if you observe
a set of data, you can update the prior distribution with the
entire dataset, or you can break it up into a series of updates
with subsets of the data, and the result is the same either way.
For this example, the key is to perform the updates one species at
a time. That way when we generate a random set of \py{ps}, only
one of them affects the computed likelihood, so the chance of choosing
a good one is much better.
Here's a new version that updates one species at a time:
\index{numpy}
\begin{code}
class Species4(Species):
def Update(self, data):
m = len(data)
for i in range(m):
one = numpy.zeros(i+1)
one[i] = data[i]
Species.Update(self, one)
\end{code}
This version inherits \verb"__init__" from \py{Species}, so it
represents the hypotheses as a list of Dirichlet objects (unlike
\py{Species2}).
\py{Update} loops through the observed species and makes an
array, \py{one}, with all zeros and one species count. Then
it calls \py{Update} in the parent class, which computes
the likelihoods and updates the sub-hypotheses.
So in the running example, we do three updates. The first
is something like ``I have seen three lions.'' The second is
``I have seen two tigers and no additional lions.'' And the third
is ``I have seen one bear and no more lions and tigers.''
Here's the new version of \py{Likelihood}:
\begin{code}
# class Species4
def Likelihood(self, data, hypo):
dirichlet = hypo
like = 0
for i in range(self.iterations):
like += dirichlet.Likelihood(data)
# correct for the number of unseen species the new one
# could have been
m = len(data)
num_unseen = dirichlet.n - m + 1
like *= num_unseen
return like
\end{code}
This is almost the same as \py{Species.Likelihood}. The difference
is the factor, \verb"num_unseen". This correction is necessary
because each time we see a species for the first time, we have to
consider that there were some number of other unseen species that
we might have seen. For larger values of \py{n} there are more
unseen species that we could have seen, which increases the likelihood
of the data.
This is a subtle point and I have to admit that I did not get it right
the first time. But again I was able to validate this version
by comparing it to the previous versions.
\index{regression testing}
\section{We're not done yet}
\newcommand{\BigO}[1]{\mathcal{O}(#1)}
Performing the updates one species at a time solves one problem, but
it creates another. Each update takes time proportional to $k m$,
where $k$ is the number of hypotheses and $m$ is the number of observed
species. So if we do $m$ updates, the total run time is
proportional to $k m^2$.
But we can speed things up using the same trick we used in
Section~\ref{collapsing}: we'll get rid of the Dirichlet objects and
collapse the two levels of the hierarchy into a single object. So
here's yet another version of \py{Species}:
\begin{code}
class Species5(Species2):
def Update(self, data):
m = len(data)
for i in range(m):
self.UpdateOne(i+1, data[i])
self.params[i] += data[i]
\end{code}
This version inherits \verb"__init__" from \py{Species2}, so
it uses \py{ns} and \py{probs} to represent the distribution
of \py{n}, and \py{params} to represent the parameters of
the Dirichlet distribution.
\py{Update} is similar to what we saw in the previous section.
It loops through the observed species and calls \py{UpdateOne}:
\index{numpy}
\begin{code}
# class Species5
def UpdateOne(self, i, count):
likes = numpy.zeros(len(self.ns), dtype=numpy.double)
for i in range(self.iterations):
likes += self.SampleLikelihood(i, count)
unseen_species = [n-i+1 for n in self.ns]
likes *= unseen_species
self.probs *= likes
self.probs /= self.probs.sum()
\end{code}
This function is similar to \py{Species2.Update}, with two changes:
\begin{itemize}
\item The interface is different. Instead of the whole dataset, we
get \py{i}, the index of the observed species, and \py{count},
how many of that species we've seen.
\item We have to apply a correction factor for the number of unseen
species, as in \py{Species4.Likelihood}. The difference here is
that we update all of the likelihoods at once with array
multiplication.
\end{itemize}
Finally, here's \py{SampleLikelihood}:
\index{numpy}
\begin{code}
# class Species5
def SampleLikelihood(self, i, count):
gammas = numpy.random.gamma(self.params)
sums = numpy.cumsum(gammas)[self.ns[0]-1:]
ps = gammas[i-1] / sums
log_likes = numpy.log(ps) * count
log_likes -= numpy.max(log_likes)
likes = numpy.exp(log_likes)
return likes
\end{code}
This is similar to \py{Species2.SampleLikelihood}; the
difference is that each update only includes a single species,
so we don't need a loop.
The runtime of this function is proportional to the number
of hypotheses, $k$. It runs $m$ times, so the run time of
the update is proportional to $k m$.
And the number of iterations we
need to get an accurate result is usually small.
\section{The belly button data}
\label{belly}
That's enough about lions and tigers and bears.
Let's get back to belly buttons. To get a sense of what the
data look like, consider subject B1242,
whose sample of 400 reads yielded 61 species with the following
counts:
\begin{code}
92, 53, 47, 38, 15, 14, 12, 10, 8, 7, 7, 5, 5,
4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
\end{code}
There are a few dominant species that make up a large
fraction of the whole, but many species that yielded only
a single read. The number of these ``singletons'' suggests
that there are likely to be at least a few unseen species.
\index{species}
In the example with lions and tigers, we assume that each
animal in the preserve is equally likely to be observed.
Similarly, for the belly button data, we assume that each
bacterium is equally likely to yield a read.
In reality, each step in the data-collection
process might introduce biases. Some species might
be more likely to be picked up by a swab, or to yield identifiable
amplicons. So when we talk about the prevalence of each species,
we should remember this source of error.
\index{sample bias}
I should also acknowledge that I am using the term ``species''
loosely. First, bacterial species are not well defined. Second,
some reads identify a particular species, others only identify
a genus. To be more precise, I should say ``operational
taxonomic unit'', or OTU.
\index{operational taxonomic unit}
\index{OTU}
Now let's process some of the belly button data. I define
a class called \py{Subject} to represent information about
each subject in the study:
\begin{code}
class Subject(object):
def __init__(self, code):
self.code = code
self.species = []
\end{code}
Each subject has a string code, like ``B1242'', and a list of
(count, species name) pairs, sorted in increasing order by count.
\py{Subject} provides several methods to make it
easy to access these counts and species names. You can see the details
in \url{http://thinkbayes.com/species.py}.
For more information
see Section~\ref{download}.
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/species-ndist-B1242.pdf}}
\caption{Distribution of \py{n} for subject B1242.}
\label{species-ndist}
\end{figure}
\py{Subject} provides a method named \py{Process} that creates and
updates a \py{Species5} suite,
which represents the distributions of \py{n} and the prevalences.
\index{prevalence}
And \py{Suite2} provides \py{DistOfN}, which returns the posterior
distribution of \py{n}.
\begin{code}
# class Suite2
def DistN(self):
items = zip(self.ns, self.probs)
pmf = thinkbayes.MakePmfFromItems(items)
return pmf
\end{code}
Figure~\ref{species-ndist} shows the distribution of \py{n} for
subject B1242. The probability that there are exactly 61 species, and
no unseen species, is nearly zero. The most likely value is 72, with
90\% credible interval 66 to 79. At the high end, it is unlikely that
there are as many as 87 species.
Next we compute the posterior distribution of prevalence for
each species. \py{Species2} provides \py{DistOfPrevalence}:
\begin{code}
# class Species2
def DistOfPrevalence(self, index):
metapmf = thinkbayes.Pmf()
for n, prob in zip(self.ns, self.probs):
beta = self.MarginalBeta(n, index)
pmf = beta.MakePmf()
metapmf.Set(pmf, prob)
mix = thinkbayes.MakeMixture(metapmf)
return metapmf, mix
\end{code}
\py{index} indicates which species we want. For each
\py{n}, we have a different posterior distribution
of prevalence.
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/species-prev-B1242.pdf}}
\caption{Distribution of prevalences for subject B1242.}
\label{species-prev}
\end{figure}
The loop iterates through the possible values of \py{n}
and their probabilities. For each value of \py{n} it gets
a Beta object representing the marginal distribution for the
indicated species. Remember that Beta objects contain the
parameters \py{alpha} and \py{beta}; they don't have
values and probabilities like a Pmf, but they provide \py{MakePmf},
which generates a discrete approximation to the continuous
beta distribution.
\index{Beta object}
\py{metapmf} is a meta-Pmf that contains the distributions
of prevalence, conditioned on \py{n}. \py{MakeMixture}
combines the meta-Pmf into \py{mix}, which combines the
conditional distributions into a single distribution
of prevalence.
\index{meta-Pmf}
\index{mixture}
\index{MakeMixture}
Figure~\ref{species-prev} shows results for the five
species with the most reads. The most prevalent species accounts for
23\% of the 400 reads, but since there are almost certainly unseen
species, the most likely estimate for its prevalence is 20\%,
with 90\% credible interval between 17\% and 23\%.
\section{Predictive distributions}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/species-rare-B1242.pdf}}
\caption{Simulated rarefaction curves for subject B1242.}
\label{species-rare}
\end{figure}
I introduced the hidden species problem in the form of four related
questions. We have answered the first two by computing the posterior
distribution for \py{n} and the prevalence of each species.
\index{predictive distribution}
The other two questions are:
\begin{itemize}
\item If we are planning to collect additional reads, can we predict
how many new species we are likely to discover?
\item How many additional reads are needed to increase the
fraction of observed species to a given threshold?
\end{itemize}
To answer predictive questions like this we can use the posterior
distributions to simulate possible future events and compute
predictive distributions for the number of species, and fraction of
the total, we are likely to see.
The kernel of these simulations looks like this:
\index{simulation}
\begin{enumerate}
\item Choose \py{n} from its posterior distribution.
\item Choose a prevalence for each species, including possible unseen
species, using the Dirichlet distribution.
\index{Dirichlet distribution}
\item Generate a random sequence of future observations.
\item Compute the number of new species, \verb"num_new", as a function
of the number of additional reads, \py{k}.
\item Repeat the previous steps and accumulate the joint distribution
of \verb"num_new" and \py{k}.
\index{joint distribution}
\end{enumerate}
And here's the code. \py{RunSimulation} runs a single simulation:
\begin{code}
# class Subject
def RunSimulation(self, num_reads):
m, seen = self.GetSeenSpecies()
n, observations = self.GenerateObservations(num_reads)
curve = []
for k, obs in enumerate(observations):
seen.add(obs)
num_new = len(seen) - m
curve.append((k+1, num_new))
return curve
\end{code}
\verb"num_reads" is the number of additional reads to simulate.
\py{m} is the number of seen species, and \py{seen} is a set of
strings with a unique name for each species.
\py{n} is a random value from the posterior distribution, and
\py{observations} is a random sequence of species names.
Each time through the loop, we add the new observation to
\py{seen} and record the number of reads and the number of
new species so far.
The result of \py{RunSimulation} is a {\bf rarefaction curve},
represented as a list of pairs with the number of reads and
the number of new species.
\index{rarefaction curve}
Before we see the results, let's look at \py{GetSeenSpecies} and
\py{GenerateObservations}.
\begin{code}
#class Subject
def GetSeenSpecies(self):
names = self.GetNames()
m = len(names)
seen = set(SpeciesGenerator(names, m))
return m, seen
\end{code}
\py{GetNames} returns the list of species names that appear in
the data files, but for many subjects these names are not unique.
So I use \py{SpeciesGenerator} to extend each name with a serial
number:
\index{generator}
\begin{code}
def SpeciesGenerator(names, num):
i = 0
for name in names:
yield '
i += 1
while i < num:
yield 'unseen-
i += 1
\end{code}
Given a name like \py{Corynebacterium}, \py{SpeciesGenerator} yields
\py{Corynebacterium-1}. When the list of names is exhausted, it
yields names like \py{unseen-62}.
Here is \py{GenerateObservations}:
\begin{code}
# class Subject
def GenerateObservations(self, num_reads):
n, prevalences = self.suite.SamplePosterior()
names = self.GetNames()
name_iter = SpeciesGenerator(names, n)
d = dict(zip(name_iter, prevalences))
cdf = thinkbayes.MakeCdfFromDict(d)
observations = cdf.Sample(num_reads)
return n, observations
\end{code}
Again, \verb"num_reads" is the number of additional reads
to generate. \py{n} and \py{prevalences} are samples from
the posterior distribution.
\py{cdf} is a Cdf object that maps species names, including the
unseen, to cumulative probabilities. Using a Cdf makes it efficient
to generate a random sequence of species names.
\index{Cdf}
\index{cumulative probability}
Finally, here is \py{Species2.SamplePosterior}:
\begin{code}
def SamplePosterior(self):
pmf = self.DistOfN()
n = pmf.Random()
prevalences = self.SamplePrevalences(n)
return n, prevalences
\end{code}
And \py{SamplePrevalences}, which generates a sample of
prevalences conditioned on \py{n}:
\index{numpy}
\index{random sample}
\begin{code}
# class Species2
def SamplePrevalences(self, n):
params = self.params[:n]
gammas = numpy.random.gamma(params)
gammas /= gammas.sum()
return gammas
\end{code}
We saw this algorithm for generating random values from a Dirichlet
distribution in Section~\ref{randomdir}.
Figure~\ref{species-rare} shows 100 simulated rarefaction curves
for subject B1242. The curves are ``jittered;''
that is, I shifted each curve by a random offset so they
would not all overlap. By inspection we can estimate that after
400 more reads we are likely to find 2--6 new species.
\section{Joint posterior}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/species-cond-B1242.pdf}}
\caption{Distributions of the number of new species conditioned on
the number of additional reads.}
\label{species-cond}
\end{figure}
We can use these simulations to estimate the
joint distribution of \verb"num_new" and \py{k}, and from that
we can get the distribution of \verb"num_new" conditioned on any
value of \py{k}.
\index{joint distribution}
\begin{code}
def MakeJointPredictive(curves):
joint = thinkbayes.Joint()
for curve in curves:
for k, num_new in curve:
joint.Incr((k, num_new))
joint.Normalize()
return joint
\end{code}
\py{MakeJointPredictive} makes a Joint object, which is a
Pmf whose values are tuples.
\index{Joint object}
\py{curves} is a list of rarefaction curves created by
\py{RunSimulation}. Each curve contains a list of pairs of
\py{k} and \verb"num_new".
\index{rarefaction curve}
The resulting joint distribution is a map from each pair to
its probability of occurring. Given the joint distribution, we
can use \py{Joint.Conditional}
get the distribution of \verb"num_new" conditioned on \py{k}
(see Section~\ref{conditional}).
\index{conditional distribution}
\py{Subject.MakeConditionals} takes a list of \py{ks}
and computes the conditional distribution of \verb"num_new"
for each \py{k}. The result is a list of Cdf objects.
\begin{code}
def MakeConditionals(curves, ks):
joint = MakeJointPredictive(curves)
cdfs = []
for k in ks:
pmf = joint.Conditional(1, 0, k)
pmf.name = 'k=
cdf = pmf.MakeCdf()
cdfs.append(cdf)
return cdfs
\end{code}
Figure~\ref{species-cond} shows the results. After 100 reads, the
median predicted number of new species is 2; the 90\% credible
interval is 0 to 5. After 800 reads, we expect to see 3 to 12 new
species.
\section{Coverage}
\begin{figure}
\centerline{\includegraphics[height=2.5in]{figs/species-frac-B1242.pdf}}
\caption{Complementary CDF of coverage for a range of additional reads.}
\label{species-frac}
\end{figure}
The last question we want to answer is, ``How many additional reads
are needed to increase the fraction of observed species to a given
threshold?''
\index{coverage}
To answer this question, we need a version of \py{RunSimulation}
that computes the fraction of observed species rather than the
number of new species.
\begin{code}
# class Subject
def RunSimulation(self, num_reads):
m, seen = self.GetSeenSpecies()
n, observations = self.GenerateObservations(num_reads)
curve = []
for k, obs in enumerate(observations):
seen.add(obs)
frac_seen = len(seen) / float(n)
curve.append((k+1, frac_seen))
return curve
\end{code}
Next we loop through each curve and make a dictionary, \py{d},
that maps from the number of additional reads, \py{k}, to
a list of \py{fracs}; that is, a list of values for the
coverage achieved after \py{k} reads.
\begin{code}
def MakeFracCdfs(self, curves):
d = {}
for curve in curves:
for k, frac in curve:
d.setdefault(k, []).append(frac)
cdfs = {}
for k, fracs in d.iteritems():
cdf = thinkbayes.MakeCdfFromList(fracs)
cdfs[k] = cdf
return cdfs
\end{code}
Then for each value of \py{k} we make a Cdf of \py{fracs}; this Cdf
represents the distribution of coverage after \py{k} reads.
Remember that the CDF tells you the probability of falling below a
given threshold, so the {\em complementary} CDF tells you the
probability of exceeding it. Figure~\ref{species-frac} shows
complementary CDFs for a range of values of \py{k}.
\index{complementary CDF}
To read this figure, select the level of coverage you want to achieve
along the $x$-axis. As an example, choose 90\%.
\index{coverage}
Now you can read up the chart to find the probability of achieving
90\% coverage after \py{k} reads. For example, with 200 reads,
you have about a 40\% chance of getting 90\% coverage. With 1000 reads, you
have a 90\% chance of getting 90\% coverage.
With that, we have answered the four questions that make up the unseen
species problem. To validate the algorithms in this chapter with
real data, I had to deal with a few more details. But
this chapter is already too long, so I won't discuss them here.
You can read about the problems, and how I addressed them, at
\url{http://allendowney.blogspot.com/2013/05/belly-button-biodiversity-end-game.html}.
You can download the code in this chapter from
\url{http://thinkbayes.com/species.py}.
For more information
see Section~\ref{download}.
\section{Discussion}
The Unseen Species problem is an area of active research, and I
believe the algorithm in this chapter is a novel contribution. So in
fewer than 200 pages we have made it from the basics of probability to
the research frontier. I'm very happy about that.
My goal for this book is to present three related ideas:
\begin{itemize}
\item {\bf Bayesian thinking}: The foundation of Bayesian analysis is
the idea of using probability distributions to represent uncertain
beliefs, using data to update those distributions, and using the
results to make predictions and inform decisions.
\item {\bf A computational approach}: The premise of this book is that
it is easier to understand Bayesian analysis using computation
rather than math, and easier to implement Bayesian methods with
reusable building blocks that can be rearranged to solve real-world
problems quickly.
\item {\bf Iterative modeling}: Most real-world problems involve
modeling decisions and trade-offs between realism and complexity.
It is often impossible to know ahead of time what factors should be
included in the model and which can be abstracted away. The best
approach is to iterate, starting with simple models and adding
complexity gradually, using each model to validate the others.
\end{itemize}
These ideas are versatile and powerful; they are applicable to
problems in every area of science and engineering, from simple
examples to topics of current research.
If you made it this far, you should be prepared to apply these
tools to new problems relevant to your work. I hope you find
them useful; let me know how it goes!
\printindex
\end{document}