CoCalc -- book.tex

GitHub Repository: allendowney/thinkbayes2
Path: blob/master/book/book.tex
¹⁹⁰⁰ views
1
% LaTeX source for ``Think Bayes: Bayesian Statistics Made Simple''
2
% Second edition
3
% Copyright 2020  Allen B. Downey.
4

5
% License: Creative Commons
6
% Attribution-NonCommercial-ShareAlike 4.0 International
7
% http://creativecommons.org/licenses/by-nc-sa/4.0/
8
%
9

10
\documentclass[12pt]{book}
11

12
\title{Think Bayes}
13
\author{Allen B. Downey}
14

15
\newcommand{\thetitle}{Think Bayes}
16
\newcommand{\thesubtitle}{Bayesian Statistics Made Simple}
17
\newcommand{\theauthor}{Allen B. Downey}
18
\newcommand{\theversion}{Version 2.1.0}
19

20
%%%% Both LATEX and PLASTEX
21

22
\usepackage{booktabs}
23

24
\usepackage{graphicx}
25
\usepackage{setspace}
26

27
\usepackage{amsmath}
28
\usepackage{amsthm}
29

30
% format end of chapter excercises
31
\newtheoremstyle{exercise}
32
  {12pt}        % space above
33
  {12pt}        % space below
34
  {}            % body font
35
  {}            % indent amount
36
  {\bfseries}   % head font
37
  {}            % punctuation
38
  {12pt}        % head space
39
  {}            % custom head
40
\theoremstyle{exercise}
41
\newtheorem{exercise}{Exercise}[chapter]
42

43
\newif\ifplastex
44
\plastexfalse
45

46
%%%% PLASTEX ONLY
47
\ifplastex
48

49
\makeindex
50

51
\usepackage{localdef}
52

53
\usepackage{url}
54
\renewcommand{\href}[2]{\url{#1}}
55

56
\makeatletter
57
\newcount\anchorcnt
58
\newcommand*{\Anchor}[1]{%
59
  \@bsphack%
60
    \Hy@GlobalStepCount\anchorcnt%
61
    \edef\@currentHref{anchor.\the\anchorcnt}%
62
    \Hy@raisedlink{\hyper@anchorstart{\@currentHref}\hyper@anchorend}%
63
    \M@gettitle{}\label{#1}%
64
    \@esphack%
65
}
66
\makeatother
67

68
% code listing environments:
69
% we don't need these for plastex because they get replaced
70
% by preprocess.py
71
%\newenvironment{code}{\begin{verbatim}}{\end{verbatim}}
72
%\newenvironment{stdout}{\begin{verbatim}}{\end{verbatim}}
73

74
% inline syntax formatting
75
%\newcommand{\py}{\verb}%}
76
%\newcommand{\py}{\texttt}%}
77
\newcommand{\py}[1]{{\tt #1}}%{
78
\newcommand{\textcolor}[1]{\relax}
79

80
%%%% LATEX/HTML ONLY
81
\else
82

83
%BEGIN LATEX
84
\usepackage{comment}
85
\excludecomment{htmlonly}
86
\includecomment{latexonly}
87
%END LATEX
88

89
\input{latexonly.tex}
90

91
\fi
92

93
%%%% END OF PREAMBLE
94
\begin{document}
95

96
\frontmatter
97

98
%%%% PLASTEX ONLY
99
\ifplastex
100

101
\maketitle
102

103
%%%% LATEX/HTML ONLY
104
\else
105

106
\begin{latexonly}
107

108
%--half title-------------------------------------------------
109
\thispagestyle{empty}
110

111
\begin{flushright}
112
\vspace*{2.0in}
113

114
\begin{spacing}{3}
115
{\huge \thetitle} \\
116
{\Large \thesubtitle}
117
\end{spacing}
118

119
\vspace{0.25in}
120

121
\theversion
122

123
\vfill
124
\end{flushright}
125

126
%--verso------------------------------------------------------
127
\newpage
128
\thispagestyle{empty}
129

130
\quad
131

132
%--title page-------------------------------------------------
133
\newpage
134
\thispagestyle{empty}
135

136
\begin{flushright}
137
\vspace*{2.0in}
138

139
\begin{spacing}{3}
140
{\huge \thetitle} \\
141
{\Large \thesubtitle}
142
\end{spacing}
143

144
\vspace{0.25in}
145

146
\theversion
147

148
\vspace{1in}
149

150
{\Large \theauthor}
151

152
\vspace{0.5in}
153

154
{\Large Green Tea Press}
155

156
{\small Needham, Massachusetts}
157

158
\vfill
159
\end{flushright}
160

161
%--copyright--------------------------------------------------
162
\newpage
163
\thispagestyle{empty}
164

165
Copyright \copyright ~2020 \theauthor.
166

167
\vspace{0.2in}
168

169
\begin{flushleft}
170
Green Tea Press \\
171
9 Washburn Ave \\
172
Needham, MA 02492
173
\end{flushleft}
174

175
Permission is granted to copy, distribute, and/or modify this work under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, which is available at \url{https://creativecommons.org/licenses/by-nc-sa/4.0/}.
176

177

178
The \LaTeX\ source for this book is available from
179
\url{http://greenteapress.com/thinkbayes2}.
180

181
%--table of contents------------------------------------------
182

183
\cleardoublepage
184
\setcounter{tocdepth}{1}
185
\tableofcontents
186

187
\end{latexonly}
188

189
%--HTML title page--------------------------------------------
190

191
\begin{htmlonly}
192

193
\vspace{1em}
194

195
{\Large \thetitle: \thesubtitle}
196

197
{\large \theauthor}
198

199
\theversion
200

201
\vspace{1em}
202

203
Copyright \copyright ~2020 \theauthor.
204

205
Permission is granted to copy, distribute, and/or modify this document
206
under the terms of the Creative Commons
207
Attribution-NonCommercial-ShareAlike 4.0 International
208
Unported License, which is available at
209
\url{http://creativecommons.org/licenses/by-nc-sa/4.0/}.
210

211
\setcounter{chapter}{-1}
212

213
\end{htmlonly}
214

215
%-------------------------------------------------------------
216

217
%%%% END OF THE PART WE SKIP FOR PLASTEX
218
\fi
219

220

221
\chapter{Preface}
222
\label{preface}
223

224
\section{My theory, which is mine}
225

226
The premise of this book, and the other books in the {\it Think X}
227
series, is that if you know how to program, you
228
can use that skill to learn other topics.
229

230
Most books on Bayesian statistics use mathematical notation and
231
present ideas in terms of mathematical concepts like calculus.
232
This book uses Python code instead of math, and discrete approximations
233
instead of continuous mathematics.  As a result, what would
234
be an integral in a math book becomes a summation, and
235
most operations on probability distributions are simple loops.
236

237
I think this presentation is easier to understand, at least for people with
238
programming skills.  It is also more general, because when we make
239
modeling decisions, we can choose the most appropriate model without
240
worrying too much about whether the model lends itself to conventional
241
analysis.
242

243
Also, it provides a smooth development path from simple examples to
244
real-world problems.  Chapter~\ref{estimation} is a good example.  It
245
starts with a simple example involving dice, one of the staples of
246
basic probability.  From there it proceeds in small steps to the
247
locomotive problem, which I borrowed from Mosteller's {\it
248
  Fifty Challenging Problems in Probability with Solutions}, and from
249
there to the German tank problem, a famously successful application of
250
Bayesian methods during World War II.
251

252

253
\section{Modeling and approximation}
254

255
Most chapters in this book are motivated by a real-world problem, so
256
they involve some degree of modeling.  Before we can apply Bayesian
257
methods (or any other analysis), we have to make decisions about which
258
parts of the real-world system to include in the model and which
259
details we can abstract away.  \index{modeling}
260

261
For example, in Chapter~\ref{prediction}, the motivating problem is to
262
predict the winner of a hockey game.  I model goal-scoring as a
263
Poisson process, which implies that a goal is equally likely at any
264
point in the game.  That is not exactly true, but it is probably a
265
good enough model for most purposes.
266
\index{Poisson process}
267

268
In Chapter~\ref{evidence} the motivating problem is interpreting SAT
269
scores (the SAT is a standardized test used for college admissions in
270
the United States).  I start with a simple model that assumes that all
271
SAT questions are equally difficult, but in fact the designers of the
272
SAT deliberately include some questions that are relatively easy and
273
some that are relatively hard.  I present a second model that accounts
274
for this aspect of the design, and show that it doesn't have a big
275
effect on the results after all.
276

277
I think it is important to include modeling as an explicit part
278
of problem solving because it reminds us to think about modeling
279
errors (that is, errors due to simplifications and assumptions
280
of the model).
281

282
Many of the methods in this book are based on discrete distributions,
283
which makes some people worry about numerical errors.  But for
284
real-world problems, numerical errors are almost always
285
smaller than modeling errors.
286

287
Furthermore, the discrete approach often allows better modeling
288
decisions, and I would rather have an approximate solution
289
to a good model than an exact solution to a bad model.
290

291
On the other hand, continuous methods sometimes yield performance
292
advantages---for example by replacing a linear- or quadratic-time
293
computation with a constant-time solution.
294

295
So I recommend a general process with these steps:
296

297
\begin{enumerate}
298

299
\item While you are exploring a problem, start with simple models and
300
  implement them in code that is clear, readable, and demonstrably
301
  correct.  Focus your attention on good modeling decisions, not
302
  optimization.
303

304
\item Once you have a simple model working, identify the
305
  biggest sources of error.  You might need to increase the number of
306
  values in a discrete approximation, or increase the number of
307
  iterations in a Monte Carlo simulation, or add details to the model.
308

309
\item If the performance of your solution is good enough for your
310
  application, you might not have to do any optimization.  But if you
311
  do, there are two approaches to consider.  You can review your code
312
  and look for optimizations; for example, if you cache previously
313
  computed results you might be able to avoid redundant computation.
314
  Or you can look for analytic methods that yield computational
315
  shortcuts.
316

317
\end{enumerate}
318

319
One benefit of this process is that Steps 1 and 2 tend to be fast, so you
320
can explore several alternative models before investing heavily in any
321
of them.
322

323
Another benefit is that if you get to Step 3, you will be starting
324
with a reference implementation that is likely to be correct,
325
which you can use for regression testing (that is, checking that the
326
optimized code yields the same results, at least approximately).
327
\index{regression testing}
328

329

330
\section{Working with the code}
331
\label{codeinfo}
332

333
There are several ways you can work with the code in this book:
334

335
\begin{itemize}
336

337
\item If you don't have a programming environment where you can run Jupyter notebooks, and you don't want to create one, you can run the notebooks on Colab, which is an online service provided by Google.  Colab let's you run Jupyter notebooks in a browser without installing anything.
338

339
\item If you have Python and Jupyter installed, you can download the code and run it on your computer.
340

341
\end{itemize}
342

343
To run the notebooks on Colab, you can follow the links at the end of each chapter, or you can start from \url{}, which has links to all of the notebooks.
344

345
If you already have Python and Jupyter, you can download the code from
346
my Git repository, at \url{https://github.com/AllenDowney/ThinkBayes}.  Git is a version control system that allows you to keep track of the files that make up a project.
347
A collection of files under Git's control is
348
called a ``repository''.
349
GitHub is a hosting service that provides storage for Git repositories and a convenient web interface.
350

351
\index{repository}
352
\index{Git}
353
\index{GitHub}
354

355
The GitHub homepage for my repository provides several ways to download the code:
356

357
\begin{itemize}
358

359
\item You can create a copy of my repository
360
on GitHub by pressing the {\sf Fork} button.  If you don't already
361
have a GitHub account, you'll need to create one.  After forking, you'll
362
have your own repository on GitHub that you can use to keep track
363
of code you write while working on this book.  Then you can
364
clone the repo, which means that you copy the files
365
to your computer.
366
\index{fork}
367

368
\item Or you could clone
369
my repository.  You don't need a GitHub account to do this, but you
370
won't be able to write your changes back to GitHub.
371
\index{clone}
372

373
\item If you don't want to use Git at all, you can download the files
374
in a Zip file using the button in the lower-right corner of the
375
GitHub page.  Or you can download the Zip file from \url{}.
376

377
\end{itemize}
378

379
If you don't have Python and Jupyter installed already, I recommend you install Anaconda, which is a free Python distribution that includes
380
all the packages you'll need to run the code (and lots more).
381
I found Anaconda easy to install.  By default it installs files in your home directory, so you don't need administrator privileges.  You can download Anaconda from \url{https://www.anaconda.com/products/individual}.
382
\index{Anaconda}
383

384
If you install Anaconda, you will have most of the packages you need to run the code in this book.
385
To make sure you have everything you need (and the right versions), the best option is to create a Conda environment.  And the best way to do that is to use the command line.
386
If you are not familiar with the command line, you might want to run the notebooks on Colab.
387

388
\begin{enumerate}
389

390
\item After downloading my repository, you should have a directory named \py{ThinkBayes2}.  Use \py{cd} to move into that directory.
391

392
\item Use \py{ls} to confirm that you have a file named \py{environment.yml}.  It lists the packages you need.
393

394
\item Run the following command to create an environment:
395

396
\begin{verbatim}
397
conda env create -f environment.yml
398
\end{verbatim}
399

400
\item Run the following command to activate the environment you just created:
401

402
\begin{verbatim}
403
conda activate ThinkBayes2
404
\end{verbatim}
405

406
\item To test your environment and make sure it has everything we need, run the following command:
407

408
\begin{verbatim}
409
python test_env.py
410
\end{verbatim}
411

412
\end{enumerate}
413

414
If you don't want to create an environment just for this book, you can install what you need using Conda.
415
The following commands should get everything you need:
416

417
\begin{verbatim}
418
conda install python jupyter pandas scipy matplotlib
419
pip install empiricaldist
420
\end{verbatim}
421

422
If you don't want to use Anaconda, you will need the following
423
packages:
424

425
\begin{itemize}
426

427
\item Jupyter to run the notebooks, \url{https://jupyter.org/};
428
\index{Jupyter}
429

430
\item NumPy for basic numerical computation, \url{http://www.numpy.org/};
431
\index{NumPy}
432

433
\item SciPy for scientific computation, \url{http://www.scipy.org/};
434
\index{SciPy}
435

436
\item Pandas for working with data, \url{https://pandas.pydata.org/};
437
\index{Pandas}
438

439
\item matplotlib for visualization, \url{http://matplotlib.org/};
440
\index{matplotlib}
441

442
\item empiricaldist for representing distributions, \url{};
443
\index{empiricaldist}.
444
%TODO: add this URL
445

446
\end{itemize}
447

448
Although these are commonly used packages, they are not included with
449
all Python installations, and they can be hard to install in some
450
environments.  If you have trouble installing them, I
451
recommend using Anaconda or one of the other Python distributions
452
that include these packages.
453
\index{installation}
454

455

456

457
\section{Code style}
458

459
Experienced Python programmers will notice that the code in this
460
book does not comply with PEP 8, which is the most common
461
style guide for Python (\url{http://www.python.org/dev/peps/pep-0008/}).
462
\index{PEP 8}
463

464
Specifically, PEP 8 calls for lowercase function names with
465
underscores between words, \verb"like_this".  In this book and
466
the accompanying code, function and method names begin with
467
a capital letter and use camel case, \verb"LikeThis".
468

469
I broke this rule because I developed some of the code
470
while I was a Visiting Scientist at Google, so I followed
471
the Google style guide, which deviates from PEP 8 in a few
472
places.  Once I got used to Google style, I found that I liked
473
it.  And at this point, it would be too much trouble to change.
474

475
Also on the topic of style, I write ``Bayes's theorem''
476
with an {\it s} after the apostrophe, which is preferred in some
477
style guides and deprecated in others.  I don't have a strong
478
preference.  I had to choose one, and this is the one I chose.
479

480
And finally one typographical note: throughout the book, I use
481
PMF and CDF for the mathematical concept of a probability
482
mass function or cumulative distribution function, and Pmf and Cdf
483
to refer to the Python objects I use to represent them.
484

485

486
\section{Prerequisites}
487

488
There are several excellent modules for doing Bayesian statistics in
489
Python, including \py{pymc} and OpenBUGS.  I chose not to use them
490
for this book because you need a fair amount of background knowledge
491
to get started with these modules, and I want to keep the
492
prerequisites minimal.  If you know Python and a little bit about
493
probability, you are ready to start this book.
494

495
Chapter~\ref{intro} is about probability and Bayes's theorem; it has
496
no code.  Chapter~\ref{compstat} introduces \py{Pmf}, a thinly disguised
497
Python dictionary I use to represent a probability mass function
498
(PMF).  Then Chapter~\ref{estimation} introduces \py{Suite}, a kind
499
of Pmf that provides a framework for doing Bayesian updates.
500

501
In some of the later chapters, I use
502
analytic distributions including the Gaussian (normal) distribution,
503
the exponential and Poisson distributions, and the beta distribution.
504
In Chapter~\ref{species} I break out the less-common Dirichlet
505
distribution, but I explain it as I go along.  If you are not familiar
506
with these distributions, you can read about them on Wikipedia.  You
507
could also read the companion to this book, {\it Think Stats}, or an
508
introductory statistics book (although I'm afraid most of them take
509
a mathematical approach that is not particularly helpful for practical
510
purposes).
511

512

513

514
\section*{Contributor List}
515

516
If you have a suggestion or correction, please send email to
517
{\it downey@allendowney.com}.  If I make a change based on your
518
feedback, I will add you to the contributor list
519
(unless you ask to be omitted).
520
\index{contributors}
521

522
If you include at least part of the sentence the
523
error appears in, that makes it easy for me to search.  Page and
524
section numbers are fine, too, but not as easy to work with.
525
Thanks!
526

527
\small
528

529
\begin{itemize}
530

531
\item First, I have to acknowledge David MacKay's excellent book,
532
  {\it Information Theory, Inference, and Learning Algorithms}, which is
533
  where I first came to understand Bayesian methods.  With his
534
  permission, I use several problems from
535
  his book as examples.
536

537
\item This book also benefited from my interactions with Sanjoy
538
  Mahajan, especially in fall 2012, when I audited his class on
539
  Bayesian Inference at Olin College.
540

541
\item I wrote parts of this book during project nights with the Boston
542
  Python User Group, so I would like to thank them for their
543
  company and pizza.
544

545
\item Olivier Yiptong sent several helpful suggestions.
546

547
\item Yuriy Pasichnyk found several errors.
548

549
\item Kristopher Overholt sent a long list of corrections and suggestions.
550

551
\item Max Hailperin suggested a clarification in Chapter~\ref{intro}.
552

553
\item Markus Dobler pointed out that drawing cookies from a bowl
554
with replacement is an unrealistic scenario.
555

556
\item In spring 2013, students in my class, Computational Bayesian
557
  Statistics, made many helpful corrections and suggestions: Kai
558
  Austin, Claire Barnes, Kari Bender, Rachel Boy, Kat Mendoza, Arjun
559
  Iyer, Ben Kroop, Nathan Lintz, Kyle McConnaughay, Alec Radford,
560
  Brendan Ritter, and Evan Simpson.
561

562
\item Greg Marra and Matt Aasted helped me clarify the discussion of
563
  {\it The Price is Right} problem.
564

565
\item Marcus Ogren pointed out that the original statement of the
566
  locomotive problem was ambiguous.
567

568
\item Jasmine Kwityn and Dan Fauxsmith at O'Reilly Media proofread the
569
  book and found many opportunities for improvement.
570

571
\item Linda Pescatore found a typo and made some helpful suggestions.
572

573
\item Tomasz Miasko sent many excellent corrections and suggestions.
574

575
% ENDCONTRIB
576

577
\end{itemize}
578

579
Other people who spotted typos and small errors include
580
Tom Pollard,
581
Paul A. Giannaros,
582
Jonathan Edwards,
583
George Purkins,
584
Robert Marcus,
585
Ram Limbu,
586
James Lawry,
587
Ben Kahle,
588
Jeffrey Law, and
589
Alvaro Sanchez.
590

591
\normalsize
592

593
\newpage
594

595
% TABLE OF CONTENTS
596
\begin{latexonly}
597

598
\tableofcontents
599

600
\newpage
601

602
\end{latexonly}
603

604
% START THE BOOK
605
\mainmatter
606

607
\newcommand{\PMF}{\mathrm{PMF}}
608
\newcommand{\PDF}{\mathrm{PDF}}
609
\newcommand{\CDF}{\mathrm{CDF}}
610
\newcommand{\ICDF}{\mathrm{ICDF}}
611

612
\newcommand{\p}[1]{\ensuremath{\mathrm{p}(#1)}}
613
\newcommand{\odds}[1]{\ensuremath{\mathrm{o}(#1)}}
614
\newcommand{\T}[1]{\mbox{#1}}
615
\newcommand{\AND}{~\mathrm{and}~}
616
\newcommand{\NOT}{\mathrm{not}~}
617

618

619
\chapter{Bayes's Theorem}
620
\label{intro}
621

622
\section{Conditional probability}
623

624
The fundamental idea behind all Bayesian statistics is Bayes's theorem,
625
which is surprisingly easy to derive, provided that you understand
626
conditional probability.  So we'll start with probability, then
627
conditional probability, then Bayes's theorem, and on to Bayesian
628
statistics.
629
\index{conditional probability}
630
\index{probability!conditional}
631

632
A probability is a number between 0 and 1 (including both) that
633
represents a degree of belief in a fact or prediction.  The value
634
1 represents certainty that a fact is true, or that a prediction
635
will come true.  The value 0 represents certainty
636
that the fact is false.
637
\index{degree of belief}
638

639
Intermediate values represent degrees of certainty.  The value 0.5,
640
often written as 50\%, means that a predicted outcome is
641
as likely to happen as not.
642
For example, the probability that a tossed coin lands ``heads'' is close to 50\%.
643
\index{coin toss}
644

645
A conditional probability is a probability based on some relevant information.  For example, suppose I toss two coins.
646
The probability that both coins land heads is 25\%.
647

648
But suppose I toss two coins and, without showing you the result, tell you that at least one of the coins in heads.
649
What is the probability that both are heads?
650
The answer is 1/3.
651

652
Here's how I got that: when I toss the coins, there are four equally likely outcomes: heads-heads, heads-tails, tails-heads, and tails-tails.
653
When I tell you that at least one coin is heads, that eliminates one outcome, tails-tails.
654

655
The remaining outcomes are heads-heads, heads-tails, and tails-heads, and they are still equally likely.
656
So the probability of heads-heads is 1/3.
657

658
That argument is correct, but if you don't find it entirely convincing, we'll come back to this problem and solve it more carefully using Bayes's Theorem.
659

660
In this example, we computed the conditional probability of two heads, given the information that at least one coin is heads.
661

662
The usual notation for conditional probability is $\p{A|B}$, which
663
is the probability of $A$ given that $B$ is true.  In this
664
example, $A$ represents the two heads, and $B$ is the condition that at least one coin is heads.
665

666

667
\section{Conjoint probability}
668

669
{\bf Conjoint probability} is a fancy way to say the probability that
670
two things are true.  I'll use the notation $\p{A \AND B}$ to mean the
671
probability that $A$ and $B$ are both true.
672

673
\index{conjoint probability}
674
\index{probability!conjoint}
675

676
If you learned about probability in the context of coin tosses and
677
dice, you might have learned the following formula:
678
%
679
\[ \p{A \AND B} = \p{A}~\p{B} \quad\quad\mbox{WARNING: not always true}\]
680
%
681
For example, if I toss two coins, and $A$ means the first coin lands
682
face up, and $B$ means the second coin lands face up, then $\p{A} =
683
\p{B} = 0.5$, and sure enough, $\p{A \AND B} = \p{A}~\p{B} = 0.25$.
684

685
But this formula only works because in this case $A$ and $B$ are
686
independent; that is, knowing the first outcome does
687
not change the probability of the second.  Or, more formally,
688
\p{B|A} = \p{B}.
689
\index{independence}
690
\index{dependence}
691

692
Here is a different example where the outcomes are not independent.
693
Suppose that $A$ means that it rains today and $B$ means that it
694
rains tomorrow.  If I know that it rained today, it is more likely
695
that it will rain tomorrow, so $\p{B|A} > \p{B}$.
696

697
In general, the probability of a conjunction is
698
%
699
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
700
%
701
for any $A$ and $B$.  So if the chance of rain on any given day
702
is 0.5, the chance of rain on two consecutive days is not
703
0.25, but probably a bit higher.
704

705

706
\section{The cookie problem}
707
\label{cookie}
708

709
\index{Bayes's theorem}
710
\index{cookie problem}
711

712
We'll get to Bayes's theorem soon, but I want to motivate it with an
713
example called the cookie problem.\footnote{Based on an example from
714
  \url{http://en.wikipedia.org/wiki/Bayes'_theorem} that is no longer
715
  there.}
716

717
\begin{quote}
718
Suppose there are two bowls of cookies.
719
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
720
Bowl 2 contains 20 of each.
721

722
Now suppose you choose one of the bowls at random and, without
723
looking, select a cookie at random.
724
The cookie is vanilla.
725
What is the probability that it came from Bowl 1?
726
\end{quote}
727

728
This is a conditional probability; we want $\p{\T{Bowl 1} |
729
  \T{vanilla}}$, but it is not obvious how to compute it.  If I asked a
730
different question---the probability of a vanilla cookie given Bowl
731
1---it would be easy:
732
%
733
\[ \p{\T{vanilla} | \T{Bowl 1}} = 3/4 \]
734
%
735
Sadly, $\p{A|B}$ is {\em not} the same as $\p{B|A}$, but there
736
is a way to get from one to the other: Bayes's theorem.
737

738

739
\section{Bayes's theorem}
740

741
\index{Bayes's theorem!derivation}
742
\index{conjunction}
743

744
Here's how we derive Bayes's theorem.
745
We'll start with the probability of a conjunction:
746
%
747
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
748
%
749
Since we have not said anything about what $A$ and $B$ mean, they
750
are interchangeable.
751
Interchanging them yields
752
%
753
\[ \p{B \AND A} = \p{B}~\p{A|B} \]
754
%
755
Also, conjunction is commutative; that is
756
%
757
\[ \p{A \AND B} = \p{B \AND A} \]
758
%
759
That's all we need.  Pulling those pieces together, we get
760
%
761
\[ \p{B}~\p{A|B} = \p{A}~\p{B|A} \]
762
%
763
Which means there are two ways to compute the conjunction.
764
If you have $\p{A}$, you multiply by the conditional
765
probability $\p{B|A}$.
766
Or you can do it the other way around; if you
767
know \p{B}, you multiply by $\p{A|B}$.
768

769
Finally we divide through by $\p{B}$:
770
%
771
\[ \p{A|B} = \frac{\p{A}~\p{B|A}}{\p{B}} \]
772
%
773
And that's Bayes's theorem!  It might not look like much, but
774
it turns out to be surprisingly powerful.
775

776
For example, we can use it to solve the cookie problem.  I'll write
777
$B_1$ for the hypothesis that the cookie came from Bowl 1
778
and $V$ for the vanilla cookie.  Plugging in Bayes's theorem
779
we get
780
%
781
\[ \p{B_1|V} = \frac{\p{B_1}~\p{V|B_1}}{\p{V}} \]
782
%
783
The term on the left is what we want: the probability of Bowl 1, given
784
that we chose a vanilla cookie.  The terms on the right are:
785

786
\begin{itemize}
787

788
\item $\p{B_1}$: This is the probability that we chose Bowl 1, unconditioned by what kind of cookie we got.  Since the problem says we chose a bowl at random, we can assume $\p{B_1} = 1/2$.
789

790
\item $\p{V|B_1}$: This is the probability of getting a vanilla cookie
791
from Bowl 1, which is 3/4.
792

793
\item $\p{V}$: This is the probability of drawing a vanilla cookie from
794
either bowl.  Since we had an equal chance of choosing either bowl
795
and the bowls contain the same number of cookies, we had the same
796
chance of choosing any cookie.  Between the two bowls there are
797
50 vanilla and 30 chocolate cookies, so $\p{V} = 5/8$.
798

799
\end{itemize}
800

801
Putting it together, we have
802
%
803
\[ \p{B_1|V} = \frac{(1/2)~(3/4)}{5/8} \]
804
%
805
which reduces to 3/5.  So the vanilla cookie is evidence in favor of
806
the hypothesis that we chose Bowl 1, because vanilla cookies are more
807
likely to come from Bowl 1.
808

809
\index{evidence}
810

811
This example demonstrates one use of Bayes's theorem: it provides
812
a strategy to get from \p{B|A} to \p{A|B}.  This strategy is useful
813
in cases, like the cookie problem, where it is easier to compute
814
the terms on the right side of Bayes's theorem than the term on the
815
left.
816

817

818
\section{The diachronic interpretation}
819

820
There is another way to think of Bayes's theorem: it gives us a
821
way to update the probability of a hypothesis, $H$, in light of
822
some body of data, $D$.
823

824
\index{diachronic interpretation}
825

826
This way of thinking about Bayes's theorem is called the
827
{\bf diachronic interpretation}.  ``Diachronic'' means that something
828
is happening over time; in this case, the probability of the hypotheses changes over time as we see new data.
829

830
Rewriting Bayes's theorem with $H$ and $D$ yields:
831
%
832
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
833
%
834
In this interpretation, each term has a name:
835

836
\index{prior}
837
\index{posterior}
838
\index{likelihood}
839
\index{normalizing constant}
840

841
\begin{itemize}
842

843
\item \p{H} is the probability of the hypothesis before we see
844
the data, called the prior probability, or just {\bf prior}.
845

846
\item \p{H|D} is what we want to compute, the probability of
847
the hypothesis after we see the data, called the {\bf posterior}.
848

849
\item \p{D|H} is the probability of the data under the hypothesis,
850
called the {\bf likelihood}.
851

852
\item \p{D} is the {\bf total probability of the data}, under any hypothesis.
853

854
\end{itemize}
855

856
Sometimes we can compute the prior based on background information.  For example, the cookie problem specifies that we choose a bowl at random with equal probability.
857

858
In other cases the prior is subjective; that is, reasonable people
859
might disagree, either because they use different background
860
information or because they interpret the same information
861
differently.
862

863
\index{subjective prior}
864

865
The likelihood is usually the easiest part to compute.  In the
866
cookie problem, if we know which bowl the cookie came from,
867
we find the probability of a vanilla cookie by counting.
868

869
Computing the total probability of the data can be tricky.  It is supposed to be the probability of seeing the data under any hypothesis at all, but in the most general case it is hard to nail down what that means.
870

871
Most often we simplify things by specifying a set of hypotheses
872
that are:
873

874
\index{mutually exclusive}
875
\index{collectively exhaustive}
876

877
\begin{description}
878

879
\item[Mutually exclusive:] At most one hypothesis in
880
the set can be true, and
881

882
\item[Collectively exhaustive:] There are no other
883
possibilities; at least one of the hypotheses has to be true.
884

885
\end{description}
886

887
In the cookie problem, there are only two hypotheses---the cookie
888
came from Bowl 1 or Bowl 2---and they are mutually exclusive and
889
collectively exhaustive.
890

891
\index{total probability}
892

893
In that case we can compute \p{D} using the law of total probability,
894
which says that if there are two exclusive ways that something
895
might happen, you can add up the probabilities like this:
896
%
897
\[ \p{D} = \p{B_1}~\p{D|B_1} + \p{B_2}~\p{D|B_2} \]
898
%
899
Plugging in the values from the cookie problem, we have
900
%
901
\[ \p{D} = (1/2)~(3/4) + (1/2)~(1/2) = 5/8 \]
902
%
903
which is what we computed earlier by mentally combining the two
904
bowls.
905

906

907
\section{Bayes Tables}
908

909
In the cookie problem we can compute the probability of the data directly, but that's not always the case.  In fact, computing the total probability of the data is often the hardest part of the problem.
910

911
Fortunately, there is another way to solve problems like this that makes it easier: the Bayes table.
912

913
You can write a Bayes table on paper or use a spreadsheet, but for this example I'll use a Pandas DataFrame.
914

915
First I'll make empty DataFrame with one row for each hypothesis:
916

917
\begin{code}
918
import pandas as pd
919
table = pd.DataFrame(index=['Bowl 1', 'Bowl 2'])
920
\end{code}
921

922
Then I'll add columns for the prior probabilities and likelihoods.
923

924
\begin{code}
925
table['prior'] = 1/2, 1/2
926
table['likelihood'] = 3/4, 1/2
927
\end{code}
928

929
This table shows the results so far:
930

931
\input{tables/table01-01}
932

933
If we multiply the priors by the likelihoods, the results are {\bf unnormalized posteriors}; they are proportional to the posterior probabilities, but they don't add up to 1.
934

935
We can normalize them by computing the total probability of the data and dividing through.
936

937
\begin{code}
938
table['unnorm'] = table['prior'] * table['likelihood']
939
prob_data = table['unnorm'].sum()
940
table['posterior'] = table['unnorm'] / prob_data
941
\end{code}
942

943
The following table shows the result:
944

945
\input{tables/table01-02}
946

947
The posterior probability for Bowl 1 is 0.6, which is what we got using Bayes's Theorem.  As a bonus, we also get the posterior probability for Bowl 2, which is 0.4.
948

949

950
\section{The Dice Problem}
951
\label{dice}
952

953
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
954
I choose one of the dice at random, roll it, and report that the outcome is a 1.
955
What is the probability that I chose the 6-sided die?
956

957
In this example, there are three hypotheses with equal prior probabilities.
958
The data is my report that the outcome is a 1.
959
Under the hypothesis that I chose the 6-sided die, the probability of the data is 1/6.
960
If I chose the 8-sided die, the probability is 1/8, and if I chose the 12-sided die, it's 1/12.
961

962
Plugging the priors and likelihoods into a Bayes table, I get these results:
963

964
\input{tables/table01-03}
965

966
The posterior probability that I chose the 6-sided die is $4/9$.
967

968
As this example demonstrates, the table method works with more than two hypotheses.
969

970

971

972
\section{The Monty Hall problem}
973

974
\index{Monty Hall problem}
975

976
Monty Hall was the original host of the game show {\em Let's Make a
977
Deal}.
978
The Monty Hall problem is based on one of the regular
979
games on the show.
980
If you are a contestant, here's how the game works:
981

982
\begin{itemize}
983

984
\item Monty shows you three closed doors numbered 1, 2, and 3.
985
He tells you that there is a prize behind each door.
986

987
\item One prize is valuable (traditionally a car), the other two are less valuable (traditionally goats).
988

989
\item The object of the game is to guess which door has the car.
990
If you guess right, you get to keep the car.
991

992
\end{itemize}
993

994
Suppose you pick Door 1.
995
Before opening the door you chose, Monty opens Door 3 and reveals a
996
goat.
997
Then Monty offers you the option to stick with your original
998
choice or switch to the remaining unopened door.
999

1000
To maximize your chance of winning the car, should you stick with Door 1 or switch to Door 2?
1001

1002
To answer this question, we have to make some assumptions about the behavior of the host:
1003

1004
\begin{enumerate}
1005

1006
\item Monty always opens a door and offers you the option to switch.
1007

1008
\item He never opens the door you picked or the door with the car.
1009

1010
\item  If you choose the door with the car, he chooses one of the other doors at random.
1011

1012
\end{enumerate}
1013

1014
Under these assumptions, you are better off switching.
1015
If you stick, you win $1/3$ of the time.
1016
If you switch, you win $2/3$ of the time.
1017

1018
If you have not encountered this problem before, you might find the answer surprising.
1019
You would not be alone; many people have the strong intuition that it doesn't matter if you stick or switch.
1020
There are two doors left, they reason, so the chance that the car
1021
is behind Door A is 50\%.
1022
But that is wrong.
1023

1024
To see why, it might help to use a Bayes table.
1025
We start with three hypotheses: the car might be behind Door 1, 2, or 3.
1026
According to the statement of the problem, the prior probability for each door is 1/3.
1027

1028
The data is that Monty opened Door 3 and revealed a goat.
1029
So let's consider the probability of the data under each hypothesis:
1030

1031
\begin{itemize}
1032

1033
\item If the car were behind Door 3, Monty would not have opened it, so the probability of the data under this hypothesis is 0.
1034

1035
\item If the car were behind Door 2, Monty would have to open Door 3, so the probability of the data under this hypothesis is 1.
1036

1037
\item If the car were behind Door 1, Monty would choose Door 2 or 3 at random; the probability he would open Door 3 is $1/2$.
1038

1039
\end{itemize}
1040

1041
Once we figure out prior probabilities and likelihoods, the Bayes table does the rest.  Here is the result:
1042

1043
\input{tables/table01-04}
1044

1045
After Monty opens Door 3, the posterior probability of Door 1 is $1/3$; the posterior probability of Door 2 is $2/3$.
1046

1047
\index{divide-and-conquer}
1048

1049
As this example shows, our intuition for probability is not always reliable.
1050
Bayes's Theorem provides a divide-and-conquer strategy that can help:
1051

1052
\begin{enumerate}
1053

1054
\item First, write down the hypotheses and the data.
1055

1056
\item Next, figure out the prior probabilities.
1057

1058
\item Finally, compute the likelihood of the data under each hypothesis.
1059

1060
\end{enumerate}
1061

1062
The Bayes table does the rest.
1063

1064
\section{Summary}
1065

1066
In this chapter...
1067

1068
In the next chapter
1069

1070
But first you might want to work on these exercises.
1071

1072

1073
\section{Exercises}
1074

1075
The code for this chapter is in \py{chap01.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
1076
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap01.ipynb}.
1077

1078
The notebook provides space where you can work on the following problems.
1079

1080
\begin{exercise}
1081

1082
Suppose you have two coins in a box.
1083
One is a normal coin with heads on one side and tails on the other, and one is a trick coin with heads on both sides.
1084

1085
You choose a coin at random and see that one of the sides is heads.
1086
What is the probability that you chose the trick coin?
1087

1088
\end{exercise}
1089

1090

1091
\begin{exercise}
1092

1093
Suppose you meet someone and learn that they have two children.
1094
You ask if either child is a girl and they say yes.
1095
What is the probability that both children are girls?
1096

1097
Hint: Start with four equally likely hypotheses.
1098

1099
\end{exercise}
1100

1101

1102
\begin{exercise}
1103

1104
There are many variations of the Monty Hall problem (see \url{https://en.wikipedia.org/wiki/Monty_Hall_problem}).
1105

1106
For example, suppose that Monty always chooses Door 2 if he can and
1107
only chooses Door 3 if he has to (because the car is behind Door 2).
1108

1109
If you choose Door 1 and Monty opens Door 2, what is the probability the car is behind Door 3?
1110

1111
If you choose Door 1 and Monty opens Door 3, what is the probability the car is behind Door 2?
1112

1113
\end{exercise}
1114

1115
\newcommand{\MM}{M\&M}
1116

1117
\begin{exercise}
1118

1119
\MM's are small candy-coated chocolates that come in a variety of
1120
colors.  Mars, Inc., which makes \MM's, changes the mixture of
1121
colors from time to time.
1122
\index{M and M problem}
1123

1124
In 1995, they introduced blue \MM's.  Before then, the color mix in
1125
a bag of plain \MM's was 30\% Brown, 20\% Yellow, 20\% Red, 10\%
1126
Green, 10\% Orange, 10\% Tan.  Afterward it was 24\% Blue , 20\%
1127
Green, 16\% Orange, 14\% Yellow, 13\% Red, 13\% Brown.
1128

1129
Suppose a friend of mine has two bags of \MM's, and he tells me
1130
that one is from 1994 and one from 1996.  He won't tell me which is
1131
which, but he gives me one \MM~from each bag.  One is yellow and
1132
one is green.  What is the probability that the yellow one came
1133
from the 1994 bag?
1134

1135
\end{exercise}
1136

1137

1138
\chapter{Computational Statistics}
1139
\label{compstat}
1140

1141
\section{Distributions}
1142
\label{distributions}
1143

1144
In statistics a {\bf distribution} is a set of values and their
1145
corresponding probabilities.
1146
\index{distribution}
1147

1148
For example, if you toss a coin, there are two possible outcomes with approximately equal probabilities.
1149

1150
If you roll a six-sided die, the set of possible
1151
values is the numbers 1 to 6, and the probability associated
1152
with each value is 1/6.
1153
\index{dice}
1154

1155
To represent distributions, we'll use a library called \py{empiricaldist}.
1156
An ``empirical'' distribution is based on data, as opposed to a theoretical distribution.
1157

1158
This library provides a class called \py{Pmf}, which represents
1159
a {\bf probability mass function}.
1160

1161
\index{probability mass function}
1162
\index{Pmf class}
1163

1164
\py{empiricaldist} is available from the Python Package Index (PyPI).
1165
You can download it from \url{https://pypi.org/project/empiricaldist/} or install it with \py{pip}.
1166
For more details, see Section~\ref{codeinfo}.
1167

1168
To use \py{Pmf} you can import it like this:
1169

1170
\begin{code}
1171
from empiricaldist import Pmf
1172
\end{code}
1173

1174
The following example makes a \py{Pmf} that represents the outcome of a coin toss.
1175

1176
\begin{code}
1177
coin = Pmf()
1178
coin['heads'] = 1/2
1179
coin['tails'] = 1/2
1180
\end{code}
1181

1182
The two outcomes have the same probability, $1/2$.
1183

1184
This example makes a \py{Pmf} that represents the distribution
1185
of outcomes of a six-sided die:
1186

1187
\begin{code}
1188
die = Pmf()
1189
for x in [1,2,3,4,5,6]:
1190
    die[x] = 1
1191
\end{code}
1192

1193
\py{Pmf} creates an empty \py{Pmf} with no values.
1194
The \py{for} loop adds the values $1$ through $6$, each with ``probability'' $1$.
1195

1196
In this \py{Pmf}, the probabilities don't add up to 1, so they are not really probabilities.
1197
We can use \py{normalize} to make them add up to 1.
1198

1199
\begin{code}
1200
die.normalize()
1201
\end{code}
1202

1203
Another way make a \py{Pmf} is to provide a sequence of values.
1204

1205
\begin{code}
1206
die = Pmf.from_seq([1,2,3,4,5,6])
1207
\end{code}
1208

1209
In this example, every value appears once, so they all have the same probability.
1210
More generally, values can appear more than once, as in this example:
1211

1212
\begin{code}
1213
letters = Pmf.from_seq(list('Mississippi'))
1214
\end{code}
1215

1216
The following table shows the results.
1217

1218
\input{tables/table02-01}
1219

1220
The \py{qs} are the values or ``quantities'' in the distribution; the \py{ps} are the corresponding probabilities.  In the word ``Mississippi'', about 36\% of the letters are ``s''.
1221

1222
The \py{Pmf} class inherits from a Pandas \py{Series}, so anything you can do with a \py{Series}, you can also do with a \py{Pmf}.
1223

1224
For example, you can use the bracket operator to look up a value and returns the corresponding probability.
1225

1226
\begin{code}
1227
letters['s']
1228
\end{code}
1229

1230
However, if you ask for the probability of a value that's not in the distribution, you get a \py{KeyError}.
1231

1232
You can also call a \py{Pmf} as if it were a function, with a value in parentheses.
1233

1234
\begin{code}
1235
letters('s')
1236
\end{code}
1237

1238
If the value is in the distribution the results are the same.
1239
But if the value is not in the distribution, the result is $0$, not an error.
1240

1241
As these examples shows, the values in a \py{Pmf} can be integers or strings.
1242
In general, they can be any type that can be stores in the index of a Pandas Series.
1243

1244
If you are familiar with Pandas, that will help you work with \py{Pmf} objects.
1245
But I will explain what you need to know as we go along.
1246

1247

1248
\section{The Cookie Problem}
1249

1250
In this section I'll use a \py{Pmf} to solve the cookie problem from Section~\ref{cookie}.
1251
Here's the statement of the problem again:
1252

1253
\begin{quote}
1254
Suppose there are two bowls of cookies.
1255
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
1256
Bowl 2 contains 20 of each.
1257

1258
Now suppose you choose one of the bowls at random and, without
1259
looking, select a cookie at random.  The cookie is vanilla.  What is
1260
the probability that it came from Bowl 1?
1261
\end{quote}
1262

1263

1264
Here's a \py{Pmf} that represents the two hypotheses and their prior probabilities:
1265
\index{cookie problem}
1266

1267
\begin{code}
1268
prior = Pmf.from_seq(['Bowl 1', 'Bowl 2'])
1269

1270
\end{code}
1271

1272
This distribution, which contains the prior probability for each hypothesis,
1273
is called (wait for it) the {\bf prior distribution}.
1274
\index{prior distribution}
1275

1276
To update the distribution based on new data (the vanilla cookie),
1277
we multiply the priors by the likelihoods.  The likelihood
1278
of drawing a vanilla cookie from Bowl 1 is 3/4.  The likelihood
1279
for Bowl 2 is 1/2.
1280

1281
\begin{code}
1282
likelihood_vanilla = [0.75, 0.5]
1283
posterior = prior * likelihood_vanilla
1284
\end{code}
1285

1286
The result is the unnormalized posteriors.
1287
We can use \py{normalize} to compute the posterior probabilities:
1288

1289
\begin{code}
1290
posterior.normalize()
1291
\end{code}
1292

1293
The return value from \py{normalize} is the total probability of the data, which is $5/8$.
1294

1295
Finally, we can get the posterior probability for Bowl 1:
1296

1297
\begin{code}
1298
posterior('Bowl 1')
1299
\end{code}
1300

1301
And the answer is 0.6.
1302
This distribution, which contains the posterior probability for each hypothesis, is called (wait now) the {\bf posterior distribution}.
1303
\index{posterior distribution}
1304

1305
One benefit of using \py{Pmf} objects is that it is easy to do successive updates with more data.
1306
For example, suppose you put the first cookie back (so the contents of the bowls don't change) and draw again from the same bowl.
1307
If the second cookie is also vanilla, we can do a second update like this:
1308

1309
\begin{code}
1310
posterior *= likelihood_vanilla
1311
posterior.normalize()
1312
\end{code}
1313

1314
Now the posterior probability for Bowl 1 is almost 70\%.
1315
But suppose we do the same thing again and get a chocolate cookie.
1316
Here's the update.
1317

1318
\begin{code}
1319
likelihood_chocolate = [0.25, 0.5]
1320
posterior *= likelihood_chocolate
1321
posterior.normalize()
1322
\end{code}
1323

1324
Now the posterior probability for Bowl 1 is about 53\%.
1325
After two vanilla cookies and one chocolate, the posterior probabilities are close to 50/50.
1326

1327

1328
\section{More Bowls}
1329
\label{morebowls}
1330

1331
Next let's solve a cookie problem with 101 bowls:
1332

1333
\begin{itemize}
1334

1335
\item Bowl 0 contains no vanilla cookies,
1336

1337
\item Bowl 1 contains 1\% vanilla cookies,
1338

1339
\item Bowl 2 contains 2\% vanilla cookies,
1340

1341
\end{itemize}
1342

1343
and so on, up to
1344

1345
\begin{itemize}
1346

1347
\item Bowl 99 contains 99\% vanilla cookies, and
1348

1349
\item Bowl 100 contains all vanilla cookies.
1350

1351
\end{itemize}
1352

1353
As in the previous version, there are only two kinds of cookies, vanilla and chocolate.  So Bowl 0 is all chocolate cookies, Bowl 1 is 99\% chocolate, and so on.
1354

1355
\begin{figure}
1356
% chap02soln.ipynb
1357
\centerline{\includegraphics[width=4in]{figs/fig02-01.pdf}}
1358
\caption{Prior and posterior distributions for the 101 Bowls problem.}
1359
\label{fig02-01}
1360
\end{figure}
1361

1362
Suppose we choose a bowl at random, choose a cookie at random, and it turns out to be vanilla.  What is the probability that the cookie came from Bowl \py{x}, for each value of \py{x}?
1363

1364
To solve this problem, I'll use \py{np.arange} to represent 101 hypotheses, numbered from 0 to 100.
1365

1366
\begin{code}
1367
hypos = np.arange(101)
1368
\end{code}
1369

1370
The result is a NumPy array, which we can use to make the prior distribution:
1371

1372
\begin{code}
1373
prior = Pmf(1, hypos)
1374
prior.normalize()
1375
\end{code}
1376

1377
As this example shows, we an initialize a \py{Pmf} with two parameters.
1378
The first parameter is the prior probability; the second parameter is a sequence of values.
1379
Because the probabilities are all the same, we only have to provide one of them.
1380
It gets ``broadcast'' across the hypotheses.
1381

1382
Since all hypotheses have the same prior probability, this distribution is {\bf uniform}.
1383

1384
The likelihood of the data is the fraction of vanilla cookies in each bowl, which we can calculate using \py{hypos}:
1385

1386
\begin{code}
1387
likelihood_vanilla = hypos/100
1388
\end{code}
1389

1390
Now we can compute the posterior distribution in the usual way:
1391

1392
\begin{code}
1393
posterior1 = prior * likelihood_vanilla
1394
posterior1.normalize()
1395
\end{code}
1396

1397
Figure~\ref{fig02-01} (top) shows the prior distribution and the posterior distribution after one vanilla cookie.
1398
Bowl 0 has been eliminated, because it contains no vanilla cookies, and Bowl 100 is the most likely.
1399
The posterior distribution is a line because the the likelihoods are proportional to the bowl numbers.
1400

1401
Now suppose we put the cookie back, draw again from the same bowl, and get another vanilla cookie.
1402
Here's the update after the second cookie:
1403

1404
\begin{code}
1405
posterior2 = posterior1 * likelihood_vanilla
1406
posterior2.normalize()
1407
\end{code}
1408

1409
Figure~\ref{fig02-01} (middle) shows the result.
1410
Because the likelihood function is a line, the posterior after two cookies is a parabola.
1411

1412
At this point the high-numbered bowls are the most likely because they contain the most vanilla cookies, and the low-numbered bowls have been all but eliminated.
1413

1414
But suppose we draw again and get a chocolate cookie.
1415
Here's the update:
1416

1417
\begin{code}
1418
likelihood_chocolate = 1 - hypos/100
1419
posterior3 = posterior2 * likelihood_chocolate
1420
posterior3.normalize()
1421
\end{code}
1422

1423
Figure~\ref{fig02-01} (bottom) shows the result.
1424
Now Bowl 100 has been eliminated because it contains no chocolare cookies.
1425
But the high-numbered bowls are still more likely than the low-numbered bowls, because we have seen more vanilla cookies than chocolate.
1426

1427
In fact, the peak of the posterior distribution is at Bowl 67, which corresponds to the fraction of vanilla cookies in the data we've observed, $2/3$.
1428

1429
The quantity with the highest posterior probability is called the {\bf MAP}, which stands for ``maximum a posteori probability'', where ``a posteori'' is unnecessary Latin for ``posterior''.
1430

1431
To compute the MAP, we can use the \py{Series} method \py{idxmax}:
1432

1433
\begin{code}
1434
posterior3.idxmax()
1435
\end{code}
1436

1437
Or \py{Pmf} provides a more memorable name for the same thing:
1438

1439
\begin{code}
1440
 posterior3.max_prob()
1441
\end{code}
1442

1443
As you might suspect, this example isn't really about bowls; it's about estimating proportions.
1444
Imagine that you have one bowl of cookies.
1445
You don't know what fraction of cookies are vanilla, but you think it is equally likely to be any fraction from 0 to 1.
1446
If you draw three cookies and two are vanilla, what proportion of cookies in the bowl do you think are vanilla?
1447
The posterior distribution we just computed is the answer to that question.
1448

1449
We'll come back to estimating proportions in the next chapter.
1450
But first let's use a \py{Pmf} to solve the dice problem.
1451

1452

1453
\section{The Dice Problem}
1454

1455
In Section~\ref{dice} we solved the dice problem using a Bayes table.
1456
Here's the statment of the problem again:
1457

1458
\begin{quote}
1459
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
1460
I choose one of the dice at random, roll it, and report that the outcome is a 1.
1461
What is the probability that I chose the 6-sided die?
1462
\end{quote}
1463

1464
Let's solve it again using a \py{Pmf}.
1465
I'll use integers to represent the hypotheses:
1466

1467
\begin{code}
1468
hypos = [6, 8, 12]
1469
\end{code}
1470

1471
And I can make the prior distribution like this:
1472

1473
\begin{code}
1474
prior = Pmf(1/3, hypos)
1475
\end{code}
1476

1477
As in the previous example, the prior probability gets broadcast across the hypotheses.
1478

1479
Now we can compute the likelihood of the data:
1480

1481
\begin{code}
1482
likelihood1 = 1/6, 1/8, 1/12
1483
\end{code}
1484

1485
And use it to compute the posterior distribution.
1486

1487
\begin{code}
1488
posterior = prior * likelihood1
1489
posterior.normalize()
1490
\end{code}
1491

1492
Here's the result:
1493

1494
\input{tables/table02-02}
1495

1496
The posterior probability for the 6-sided die is $4/9$.
1497

1498
Now suppose I roll the same die again and get a $7$.
1499
We can do a second update like this:
1500

1501
\begin{code}
1502
likelihood2 = 0, 1/8, 1/12
1503
posterior *= likelihood2
1504
posterior.normalize()
1505
\end{code}
1506

1507
The likelihood for the 6-sided die is $0$ because it is not possible to get a 7 on a 6-sided die.
1508
The other two likelihoods are the same as in the previous update.
1509
And here's the result:
1510

1511
\input{tables/table02-03}
1512

1513
After rolling a 1 and a 7, the posterior probability of the 8-sided die is about 69\%.
1514

1515

1516
\section{Updating Dice}
1517
\label{dice2}
1518

1519
The following function is a more general version of the update in the previous section:
1520

1521
\begin{code}
1522
def update_dice(pmf, data):
1523
    hypos = pmf.qs
1524
    likelihood = 1 / hypos
1525
    impossible = (data > hypos)
1526
    likelihood[impossible] = 0
1527
    pmf *= likelihood
1528
    pmf.normalize()
1529
\end{code}
1530

1531
The first parameter is a \py{Pmf} that represents the possible dice and their probabilities.
1532
The second parameter is the outcome of rolling a die.
1533

1534
The first line selects \py{qs} from the \py{Pmf}, which is the index of the \py{Series}; in this example, it represents the hypotheses.
1535

1536
Since the hypotheses are integers, we can use them to compute the likelihoods.
1537
In general, if there are \py{n} sides on the die, the probability of any possible outcome is \py{1/n}.
1538

1539
However, we have to check for impossible outcomes!
1540
If the outcome exceeds the hypothetical number of sides on the die, the probability of that outcome is $0$.
1541

1542
\py{impossible} is a Boolean Series that is \py{True} for each impossible die.
1543
I use it as an index into \py{likelihood} to set the corresponding probabilities to $0$.
1544

1545
Finally, I multiply \py{pmf} by the likelihoods and normalize.
1546

1547
Here's how we can use this function to compute the updates in the previous section:
1548

1549
\begin{code}
1550
pmf = prior.copy()
1551
update_dice(pmf, 1)
1552
update_dice(pmf, 7)
1553
\end{code}
1554

1555
I start with a fresh copy of the prior distribution and use \py{update_dice} to do the updates.
1556
The result is the same.
1557

1558

1559
\section{Summary}
1560

1561
This chapter introduces the \py{empiricaldist} module, which provides \py{Pmf}, which we use to represent a set of hypotheses and their probabilities.
1562

1563
We use a \py{Pmf} to solve the cookie problem and the dice problem, which we saw in the previous chapter.
1564
With a \py{Pmf} it is easy to perform sequential updates as we see multiple pieces of data.
1565

1566
We also solved a more general version of the cookie problem, with 101 bowls rather than two.
1567
Then we computed the MAP, which is the quantity with the highest posterior probability.
1568

1569
In the next chapter ...
1570

1571
But first you might want to work on the exercises.
1572

1573

1574
\section{Exercises}
1575
\label{elvis}
1576

1577
The code for this chapter is in \py{chap02.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
1578
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap02.ipynb}.
1579

1580
The notebook provides space where you can work on the following problems.
1581

1582

1583
\begin{exercise}
1584
%TODO: medical test (or maybe chapter 1)
1585
\end{exercise}
1586

1587

1588
\begin{exercise}
1589
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
1590
I choose one of the dice at random, roll it four times, and get 1, 3, 5, and 7.
1591
What is the probability that I chose the 8-sided die?
1592
\end{exercise}
1593

1594

1595
\begin{exercise}
1596
In the previous version of the dice problem, the prior probabilities are the same because the box contains one of each die.
1597
But suppose the box contains 1 die that is 4-sided, 2 dice that are 6-sided, 3 dice that are 8-sided, 4 dice that are 12-sided, and 5 dice that are 20-sided.
1598
I choose a die, roll it, and get a 7.  What is the probability that I chose an 8-sided die?
1599
\end{exercise}
1600

1601

1602
\begin{exercise}
1603
Suppose I have two sock drawers.
1604
One contains equal numbers of black and white socks.
1605
The other contains equal numbers of red, green, and blue socks.
1606
Suppose I choose a drawer and random, choose two socks at random, and I tell you that I got a matching pair.
1607
What is the probability that the socks are white?
1608

1609
For simplicity, let's assume that there are so many socks in both drawers that removing one sock makes a negligible change to the proportions.
1610
\end{exercise}
1611

1612

1613
\begin{exercise}
1614
Here's a problem from {\it Bayesian Data Analysis}, which is available from \url{http://www.stat.columbia.edu/~gelman/book}:
1615

1616
\begin{quote}
1617
Elvis Presley had a twin brother (who died at birth). What is the probability that Elvis was an identical twin?
1618
\end{quote}
1619

1620
Hint: In 1935, about 2/3 of twins were fraternal and 1/3 were identical.
1621
\end{exercise}
1622

1623

1624
\chapter{Estimation}
1625
\label{more}
1626

1627
%TODO: Intro
1628

1629

1630
\section{The Euro problem}
1631
\label{euro}
1632

1633
\index{Euro problem}
1634
\index{MacKay, David}
1635
In {\it Information Theory, Inference, and Learning Algorithms}, David MacKay poses this problem:
1636

1637
\begin{quote}
1638
A statistical statement appeared in ``The Guardian'' on Friday January 4, 2002:
1639

1640
  \begin{quote}
1641
        When spun on edge 250 times, a Belgian one-euro coin came
1642
        up heads 140 times and tails 110.  `It looks very suspicious
1643
        to me,' said Barry Blight, a statistics lecturer at the London
1644
        School of Economics.  `If the coin were unbiased, the chance of
1645
        getting a result as extreme as that would be less than 7\%.'
1646
        \end{quote}
1647

1648
But do these data give evidence that the coin is biased rather than fair?
1649
\end{quote}
1650

1651
To answer that question, we'll proceed in two steps.
1652
First we'll use the binomial distribution to see where that 7\% came from; then we'll use Bayes's Theorem to estimate the probability that this coin comes up heads.
1653

1654

1655
\section{The Binomial Distribution}
1656
\label{binomial}
1657

1658
Suppose we have a coin that we know is fair; if we spin it once, the possible outcomes are heads and tails with equal probability.
1659
I'll denote these outcomes \py{H} and \py{T}.
1660

1661
If you spin it twice, there are four outcomes with equal probability: \py{HH}, \py{HT}, \py{TH}, and \py{TT}.
1662

1663
If we add up the total number of heads, there are three possible outcomes: 0, 1, or 2.  The probability of 0 and 2 is 25\%, and the probability of 1 is 50\%.
1664

1665
More generally, suppose the probability of heads is \py{p} and we spin the coin \py{n} times.  What is the probability that we get a total of \py{k} heads?
1666

1667
The answer is given by the binomial distribution:
1668
%
1669
\[ P(k; n, p) = \binom{n}{k} p^k (1-p)^{n-k} \]
1670
%
1671
where $\binom{n}{k}$ is the {\bf binomial coefficient}, usually pronounced "n choose k" (see \url{https://en.wikipedia.org/wiki/Binomial_coefficient}).
1672

1673
We can compute this expression ourselves, but we can also use the SciPy function \py{binom.pmf}:
1674

1675
\begin{code}
1676
from scipy.stats import binom
1677

1678
n = 2
1679
p = 0.5
1680
ks = np.arange(n+1)
1681
a = binom.pmf(ks, n, p)
1682
\end{code}
1683

1684
The return value is a NumPy array.
1685
If we put it in a \py{Pmf}, the result is the distribution of \py{k} for the given values of \py{n} and \py{p}.
1686

1687
\begin{code}
1688
pmf_k = Pmf(a, ks)
1689
\end{code}
1690

1691
Here's what it looks like:
1692

1693
\input{tables/table02-01}
1694

1695
We can do the same calculation with \py{n=250}; Figure~\ref{fig03-01} shows the result.
1696

1697
\begin{figure}
1698
% chap03soln.ipynb
1699
\centerline{\includegraphics[width=4in]{figs/fig03-01.pdf}}
1700
\caption{Binomial distribution with \py{n=250} and \py{p=0.5}}
1701
\label{fig03-01}
1702
\end{figure}
1703

1704
The most likely outcome is 125, which is \py{n*p}.
1705
But the probability of getting exactly 125 heads is only about 5\%.
1706
The probability of getting 140 heads, as in the Euro problem is lower, around 0.8\%, but it is still possible even if the coin is fair.
1707

1708
In the article MacKay quotes, the statistician says, ``If the coin were unbiased the chance of getting a result as extreme as that would be less than 7\%''.
1709

1710
We can use the binomial distribution to check his math.  The following function takes a PMF and computes the total probability of values greater than or equal to \py{threshold}.
1711

1712
\begin{code}
1713
def ge_dist(pmf, threshold):
1714
    ge = (pmf.index >= threshold)
1715
    total = pmf[ge].sum()
1716
    return total
1717
\end{code}
1718

1719
We can call it like this:
1720

1721
\begin{code}
1722
ge_dist(pmf_k, 140)
1723
\end{code}
1724

1725
Or \py{Pmf} provides a function that computes the same thing:
1726

1727
\begin{code}
1728
pmf_k.ge_dist(140)
1729
\end{code}
1730

1731
Either way, the probability is about 3.3\% that we get 140 heads or more.
1732
But that's less than 7%.
1733

1734
The reason is that the statistician includes all values ``as extreme as'' 140, which includes values less than or equal to 110, because 140 exceeds the expected value by 15 and 110 falls short by 15.
1735

1736
The probability of values less than or equal to 110 is also 3.3\%,
1737
so the total probability of values ``as extreme'' as 140 is 6.6\%.
1738

1739
The point of this calculation is that these extreme values are unlikely if the coin is fair.
1740
And that's why the statistician concludes that the results are ``very suspicious''.
1741

1742
That's interesting, but it doesn't answer MacKay's question.  So let's move on to the next step, estimating the proportion of heads.
1743

1744

1745
\section{Estimating Proportions}
1746
\label{estprop}
1747

1748
Any given coin has some probability of landing heads up when spun
1749
on edge; I'll call this probability \py{x}.
1750

1751
It seems reasonable to believe that \py{x} depends
1752
on physical characteristics of the coin, like the distribution
1753
of weight.
1754

1755
If a coin is perfectly balanced, we expect \py{x} to be close to 50\%, but
1756
for a lopsided coin, \py{x} might be substantially different.  We can use
1757
Bayes's theorem and the observed data to estimate \py{x}.
1758

1759
For simplicity, I'll start with a uniform prior, which assume that all values of \py{x} are equally likely.
1760
That might not be a reasonable assumption, so we'll come back and consider other priors later.
1761

1762
Here's the uniform prior:
1763

1764
\begin{code}
1765
hypos = np.arange(0, 101)
1766
prior = Pmf(1, hypos)
1767
\end{code}
1768

1769
And here are the likelihoods:
1770

1771
\begin{code}
1772
likelihood = {
1773
    'H': hypos/100,
1774
    'T': 1 - hypos/100
1775
}
1776
\end{code}
1777

1778
I put the likelihoods for heads and tails in a dictionary to make it easier to do the update.
1779

1780
To represent the data, I'll use string where each element is \py{H} or \py{T}:
1781

1782
\begin{code}
1783
dataset = 'H' * 140 + 'T' * 110
1784
\end{code}
1785

1786
The following function does the update.
1787

1788
\begin{code}
1789
def update_euro(pmf, dataset):
1790
    for data in dataset:
1791
        pmf *= likelihood[data]
1792

1793
    pmf.normalize()
1794
\end{code}
1795

1796
The first argument is a \py{Pmf} that represents the prior.
1797
The second argument is a list of strings.
1798
Each time through the loop, we multiply \py{pmf} by the likelihood of one outcome, heads or tails.
1799

1800
Notice that \py{normalize} is outside the loop, so the posterior distribution only gets normalized one, at the end.
1801
That's more efficient than normalizing it after each spin (although we'll see later that it can also cause problems with floating-point arithmetic).
1802
%TODO:  add forward reference
1803

1804
Here's how we do the update:
1805

1806
\begin{code}
1807
posterior = prior.copy()
1808
update_euro(posterior, dataset)
1809
\end{code}
1810

1811
Figure~\ref{fig03-02} shows the posterior distribution of \py{x}.
1812

1813
\begin{figure}
1814
% chap03soln.ipynb
1815
\centerline{\includegraphics[width=4in]{figs/fig03-02.pdf}}
1816
\caption{Posterior distribution of \py{x} after 140 heads in 250 spins.}
1817
\label{fig03-02}
1818
\end{figure}
1819

1820
Now, it's easy to get this distribution mixed up with the previous one, but rememeber:
1821

1822
\begin{itemize}
1823

1824
\item Figure~\ref{fig03-01} shows the distribution of \py{k}, which is the number of heads we get with \py{n=250} and \py{p=0.5}.
1825

1826
\item Figure ~\ref{fig03-02} shows the posterior distribution of \py{x} which is the proportion of heads for the coin we observed.
1827

1828
\end{itemize}
1829

1830
The posterior distribution represents our beliefs about \py{x} after seeing the data.
1831
It indicates that values less than 40 and greater than 80 are unlikely; values between 50 and 60 are the most likely.
1832

1833
In fact, the most likely value for \py{x} is 56\% which is the proportion of heads in the dataset, \py{140/250}.
1834

1835

1836
\section{Triangle Prior}
1837
\label{triangle}
1838

1839
So far we've been using a uniform prior, but that might not be a reasonable choice based on what we know about coins.
1840
I can believe that if a coin is lopsided, \py{x} might deviate substantially from 50\%, but it seems unlikely that the Belgian Euro coin is so imbalanced that \py{x} is 10\% or 90\%.
1841

1842
It might be more reasonable to choose a prior that gives
1843
higher probability to values of \py{x} near 50\% and lower probability
1844
to extreme values.
1845

1846
\index{triangle distribution}
1847

1848
As an example, let's try a triangule-shaped prior.
1849
Here's the code that constructs it:
1850

1851
\begin{code}
1852
ramp_up = np.arange(50)
1853
ramp_down = np.arange(50, -1, -1)
1854
a = np.append(ramp_up, ramp_down)
1855

1856
triangle = Pmf(a, hypos, name='triangle')
1857
triangle.normalize()
1858
\end{code}
1859

1860
\py{arange} returns a NumPy array, so we can use \py{np.append} to append \py{ramp_down} to the end of \py{ramp_up}.
1861
Then we use \py{a} and \py{hypos} to make a \py{Pmf}.
1862

1863
Figure~\ref{fig03-03} shows the result, along with the uniform distribution.
1864

1865
\begin{figure}
1866
% chap03soln.ipynb
1867
\centerline{\includegraphics[width=4in]{figs/fig03-03.pdf}}
1868
\caption{Uniform and trianlge-shaped prior distributions.}
1869
\label{fig03-03}
1870
\end{figure}
1871

1872
Now we can update both priors with the same data:
1873

1874
\begin{code}
1875
update_euro(uniform, dataset)
1876
update_euro(triangle, dataset)
1877
\end{code}
1878

1879
Figure~\ref{fig03-04} shows the posterior distributions.
1880

1881
\begin{figure}
1882
% chap03soln.ipynb
1883
\centerline{\includegraphics[width=4in]{figs/fig03-04.pdf}}
1884
\caption{Posterior distributions based on uniform and triangle priors.}
1885
\label{fig03-04}
1886
\end{figure}
1887

1888
The differences between the posterior distributions are barely visible, and so small they would hardly matter in practice.
1889

1890
And that's good news.
1891
To see why, imagine two people who disagree angrily about which prior is better, uniform or triangle.
1892
Each of them has reasons for their preference, but neither of them can persuade the other to change their mind.
1893

1894
But suppose they agree to use the data to update their beliefs.
1895
When they compare their posterior distributions, they find that there is almost nothing left to argue about.
1896

1897
This is an example of {\bf swamping the priors}: with enough
1898
data, people who start with different priors will tend to
1899
converge on the same posterior distribution.
1900

1901
\index{swamping the priors}
1902
\index{convergence}
1903

1904

1905
\section{Binomial Likelihood}
1906
\label{binomlike}
1907

1908
So far we've been computing the updates one spin at a time, so for the Euro problem we have to do 250 updates.
1909

1910
A more efficient alternative is to compute the likelihood of the entire dataset at once.
1911
For each hypothetical value of \py{x}, we have to compute the probability of getting 140 heads out of 250 spins.
1912

1913
Well, we know how to do that; this is the question the binomial distribution answers.
1914
If the probability of heads is $p$, the probability of $k$ heads in $n$ spins is:
1915
%
1916
\[ P(k; n, p) = \binom{n}{k} p^k (1-p)^{n-k} \]
1917
%
1918
And we can use SciPy to compute it.
1919
The following function takes a \py{Pmf} that represents a prior distribution and a tuple of integers, \py{k} and \py{n}:
1920

1921
\begin{code}
1922
from scipy.stats import binom
1923

1924
def update_binomial(pmf, data):
1925
    k, n = data
1926
    xs = pmf.qs
1927
    likelihood = binom.pmf(k, n, xs)
1928
    pmf *= likelihood
1929
    pmf.normalize()
1930
\end{code}
1931

1932
It extracts the hypothetical values of \py{x} from the \py{Pmf} and passes them to \py{binom.pmf}, which computes the binomial PMF for the given values of \py{k} and \py{n}, and all values of \py{x}.
1933

1934
Here's how we use it:
1935

1936
\begin{code}
1937
uniform2 = Pmf(1, hypos)
1938
data = 140, 250
1939
update_binomial(uniform2, data)
1940
\end{code}
1941

1942
The result is the same as in Section~\ref{estprop} except for a small floating-point round-off.
1943
But it's much more efficient.
1944

1945

1946
\section{Bayesian Statistics}
1947

1948
You might have noticed similarities between the Euro problem and the 101 bowls problem in Section~\ref{morebowls}.
1949
The prior distributions are the same, the likelihoods are the same, and with the same data the results would be the same.
1950

1951
But there are two differences.
1952

1953
The first is the choice of the prior.
1954
In the 101 bowls problem, the uniform prior is implied by the statement of the problem, which says that we choose one of the bowls at random with equal probability.
1955

1956
In the Euro problem, the choice of the prior is subjective; that is, reasonable people could disagree, maybe because they have different information about coins or because they interpret the same information differently.
1957

1958
Because the priors are subjective, the posteriors are subjective, too.
1959
And some people find that problematic.
1960

1961
The other difference is the nature of what we are estimating.
1962
In the 101 bowls problem, we choose the bowl randomly, so it is uncontroversial to compute the probability of choosing each bowl.
1963
In the Euro problem, the proportion of heads is a physical property of a given coin.
1964
Under some interpretations of probability, that's a problem because physical properties are not considered random.
1965

1966
As an example, consider the age of the universe.
1967
Currently, our best estimate is 13.80 billion years, but it might be off by 0.02 billion years in either direction (see \url{https://en.wikipedia.org/wiki/Age_of_the_universe}).
1968

1969
Now suppose we would like to know the probability that the age of the universe is actually greater than 13.81 billion years.
1970
Under some interpretations of probability, we would not be able to answer that question.
1971
We would be required to say something like, ``The age of the universe is not a random quantity, so it has no probability of exceeding a particular value.''
1972

1973
Under the Bayesian interpretation of probability, it is meaningful and useful to treat physical quantities as if they were random and compute probabilities about them.
1974

1975
In the Euro problem, the prior distribution represents what we believe about coins in general and the posterior distribution represents what we believe about a particular coin after seeing the data.
1976
So we can use the posterior distribution to compute probabilities about the coin and its proportion of heads.
1977

1978
The subjectivity of the prior and the interpretation of the posterior are key differences between Bayes's Theorem and Bayesian statistics.
1979

1980
Bayes's Theorem is a mathematical law of probability; no reasonable person objects to it.
1981
But Bayesian statistics is surprisingly controversial.
1982
Historically, many people have been bothered by its subjectivity and its use of probability for things that are not random.
1983

1984
If you are interested in this history, I recommend Sharon Bertsch McGrayne's book, {\it The Theory That Would Not Die} (\url{https://yalebooks.yale.edu/book/9780300188226/theory-would-not-die}).
1985

1986
\index{McGrayne, Sharon Bertsch}
1987
\index{The Theory That Would Not Die}
1988

1989
%TODO: Italicize the index entry
1990

1991
\section{Summary}
1992

1993
In this chapter I posed David MacKay's Euro problem and we started to solve it.
1994
Given the data, we computed the posterior distribution for \py{x}, the probability a Euro coin comes up heads.
1995

1996
We tried two different priors, updated them with the same data, and found that the posteriors were nearly the same.
1997
This is good news, because it suggests that if two people start with different beliefs and see the same data, their beliefs tend to converge.
1998

1999
This chapter introduces the binomial distribution, which we used to compute the posterior distribution more efficiently.
2000
And I discussed the difference between applying Bayes's Theorem, as in the 101 bowls problem, and computing Bayesian statistics, as in the Euro problem.
2001

2002
\index{convergence}
2003

2004
However, we still haven't answered MacKay's question: ``Do these data give evidence that the coin is biased rather than fair?''
2005
I'm going to leave this question hanging a little longer; we'll come back to it in Chapter~\ref{hypotest}.
2006

2007
In the next chapter, I want to get back to the dice problem.
2008

2009
\section{Exercises}
2010

2011
The code for this chapter is in \py{chap03.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
2012
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap03.ipynb}.
2013

2014
The notebook provides space where you can work on the following problems.
2015

2016

2017
\begin{exercise}
2018
In Major League Baseball, most players have a batting average between 200 and 330, which means that the probability of getting a hit is between 0.2 and 0.33.
2019

2020
Suppose a new player appearing in his first game gets 3 hits out of 3 attempts.  What is the posterior distribution for his probability of getting a hit?
2021
\end{exercise}
2022

2023

2024
\begin{exercise}
2025
Whenever you survey people about sensitive issues, you have to deal with ``social desirability bias'', which is the tendency of people to shade their answers to show themselves in the most positive light (see \url{https://en.wikipedia.org/wiki/Social_desirability_bias}).
2026

2027
One of the ways to improve the accuracy of the results is ``randomized response'' (see \url{https://en.wikipedia.org/wiki/Randomized_response}).
2028

2029
As an example, suppose you ask 100 people to flip a coin and:
2030

2031
\begin{itemize}
2032

2033
\item If they get heads, they report YES.
2034

2035
\item If they get tails, they honestly answer the question ``Do you cheat on your taxes?''
2036

2037
\end{itemize}
2038

2039
And suppose you get 80 YESes and 20 NOs.  Based on this data, what is the posterior distribution for the fraction of people who cheat on their taxes?  What is the most likely value in the posterior distribution?
2040
\end{exercise}
2041

2042

2043
\begin{exercise}
2044
Suppose that instead of observing coin spins directly, you measure the outcome using an instrument that is not always correct.  Specifically, suppose the probability is \py{y=0.2} that an actual heads is reported
2045
as tails, or actual tails reported as heads.
2046

2047
If we spin a coin 250 times and the instrument reports 140 heads, what is the posterior distribution of \py{x}?
2048

2049
What happens as you vary the value of \py{y}?
2050
\end{exercise}
2051

2052

2053
\begin{exercise}
2054
In preparation for an alien invasion, the Earth Defense League (EDL) has been working on new missiles to shoot down space invaders.  Of course, some missile designs are better than others; let's assume that each design has some probability of hitting an alien ship, \py{x}.
2055

2056
Based on previous tests, the distribution of \py{x} in the population of designs is approximately uniform between 0.1 and 0.4.
2057

2058
Now suppose the new ultra-secret Alien Blaster 9000 is being tested.  In a press conference, an EDL general reports that the new design has been tested twice, taking two shots during each test.  The results of the test are confidential, so the general won't say how many targets were hit, but they report: ``The same number of targets were hit in the two tests, so we have reason to think this new design is consistent.''
2059

2060
Is this data good or bad; that is, does it increase or decrease your estimate of \py{x} for the Alien Blaster 9000?
2061

2062
Hint: If the probability of hitting each target is $x$, the probability of hitting one target in both tests is $[2x(1-x)]^2$.
2063
\end{exercise}
2064

2065

2066
\chapter{More Estimation}
2067
\label{estimation}
2068

2069
\section{The train problem}
2070

2071
\index{train problem}
2072
\index{Mosteller, Frederick}
2073
\index{German tank problem}
2074

2075
I found the train problem
2076
in Frederick Mosteller's, {\it Fifty Challenging Problems in
2077
  Probability with Solutions} (\url{https://store.doverpublications.com/0486653552.html}):
2078

2079
\begin{quote}
2080
``A railroad numbers its locomotives in order $1..N$.  One day you see a
2081
locomotive with the number 60.  Estimate how many locomotives the
2082
railroad has.''
2083
\end{quote}
2084

2085
Based on this observation, we know the railroad has 60 or more
2086
locomotives.  But how many more?  To apply Bayesian reasoning, we
2087
can break this problem into two steps:
2088

2089
\begin{enumerate}
2090

2091
\item What did we know about $N$ before we saw the data?
2092

2093
\item For any given value of $N$, what is the likelihood of
2094
seeing the data (a locomotive with number 60)?
2095

2096
\end{enumerate}
2097

2098
The answer to the first question is the prior.  The answer to the
2099
second is the likelihood.
2100

2101
\begin{figure}
2102
% train.py
2103
\centerline{\includegraphics[height=2.5in]{figs/train1.pdf}}
2104
\caption{Posterior distribution for the locomotive problem, based
2105
on a uniform prior.}
2106
\label{fig.train1}
2107
\end{figure}
2108

2109
We don't have much basis to choose a prior, so we'll start with
2110
something simple and then consider alternatives.
2111
Let's assume that $N$ is equally likely to be any value from 1 to 1000.
2112

2113
\begin{code}
2114
hypos = np.arange(1, 1001)
2115
prior = Pmf(1, hypos)
2116
\end{code}
2117

2118
Now let's figure out the likelihood of the data.
2119
In a hypothetical fleet of $N$ locomotives, what is the probability that we would see number 60?
2120
If we assume that we are equally likely to see any locomotive, the chance of seeing any particular one is $1/N$.
2121

2122
Here's the function that does the update:
2123

2124
\begin{code}
2125
def update_train(pmf, data):
2126
    hypos = pmf.qs
2127
    likelihood = 1 / hypos
2128
    impossible = (data > hypos)
2129
    likelihood[impossible] = 0
2130
    pmf *= likelihood
2131
    pmf.normalize()
2132
\end{code}
2133

2134
The first parameter is a \py{Pmf} that represents the possible values of $N$ and their probabilities.
2135
The second parameter is the number of the train we observed.
2136

2137
This function might look familiar; it is the same as the update function for the dice problem in Section~\ref{dice2}.
2138

2139
\index{dice problem}
2140

2141
Here's the update:
2142

2143
\begin{code}
2144
data = 60
2145
posterior = prior.copy()
2146
update_train(posterior, data)
2147
\end{code}
2148

2149
Figure~\ref{fig04-01} shows the results.
2150

2151
\begin{figure}
2152
% chap04soln.ipynb
2153
\centerline{\includegraphics[width=4in]{figs/fig04-01.pdf}}
2154
\caption{Posterior distribution of the number of trains, $N$, after seeing train number 60.}
2155
\label{fig04-01}
2156
\end{figure}
2157

2158
Not surprisingly, all values of $N$ below 60 have been eliminated.
2159

2160
The most likely value, if you had to guess, is 60.
2161
That might not seem like a very good guess; after all, what are the chances that you just happened to see the train with the highest number?
2162
Nevertheless, if you want to maximize the chance of getting
2163
the answer exactly right, you should guess 60.
2164

2165
But maybe that's not the right goal.
2166
An alternative is to compute the mean of the posterior distribution.
2167
Given a set of possible quantities, $q_i$, and their probabilities, $p_i$, the mean of the distribution is:
2168
%
2169
\[ \mathrm{mean} = \sum_i p_i q_i \]
2170
%
2171
Which we can compute like this:
2172

2173
\begin{code}
2174
np.sum(posterior.ps * posterior.qs)
2175
\end{code}
2176

2177
Or we can use the method provided by \py{Pmf}:
2178

2179
\begin{code}
2180
posterior.mean()
2181
\end{code}
2182

2183
The mean of the posterior is 333, so that might be a good guess if you want to minimize error.
2184
If you played this guessing game over and over, using the mean of the posterior as your estimate would minimize the mean squared error over the long run (see \url{http://en.wikipedia.org/wiki/Minimum_mean_square_error}).
2185

2186
\index{mean squared error}
2187

2188

2189
\section{What about that prior?}
2190

2191
The prior I chose in the previous section is uniform from 1 to 1000, but I offered no justification for choosing a uniform distribution or that particular upper bound.
2192

2193
\index{prior distribution}
2194

2195
We might wonder whether the posterior distribution is sensitive to the prior.
2196
With so little data---only one observation---it is:
2197

2198
\begin{itemize}
2199

2200
\item With a uniform prior from 1 to 500, the posterior mean is 207.
2201

2202
\item With an upper bound of 1000, it's 333.
2203

2204
\item With an upper bound of 2000, it's 552.
2205

2206
\end{itemize}
2207

2208
So that's bad.
2209
When the posterior is sensitive to the prior, there are two ways to proceed:
2210

2211
\begin{itemize}
2212

2213
\item Get more data.
2214

2215
\item Get more background information and choose a better prior.
2216

2217
\end{itemize}
2218

2219
With more data, posterior distributions based on different
2220
priors tend to converge.
2221
For example, suppose that in addition
2222
to train 60 we also see trains 30 and 90.
2223
We can update the distribution like this:
2224

2225
\begin{code}
2226
for data in [30, 60, 90]:
2227
    update_train(pmf, data)
2228
\end{code}
2229

2230
With these data, the means of the posteriors are
2231

2232
\begin{tabular}{r r}
2233
\toprule
2234
Upper & Posterior \\
2235
Bound & Mean \\
2236
\midrule
2237
500 & 152 \\
2238
1000 & 164\\
2239
2000 & 171\\
2240
\bottomrule
2241
\end{tabular}
2242

2243
The differences are smaller, but apparently three trains is not enough for the posteriors to converge.
2244

2245

2246
\section{Another prior}
2247

2248
\begin{figure}
2249
% train.py
2250
\centerline{\includegraphics[height=2.5in]{figs/train4.pdf}}
2251
\caption{Posterior distribution based on a power law prior,
2252
compared to a uniform prior.}
2253
\label{fig.train4}
2254
\end{figure}
2255

2256
If more data are not available, another option is to improve the
2257
priors by gathering more background information.
2258
It is probably not reasonable to assume that a train-operating company with 1000 locomotives is just as likely as a company with only 1.
2259

2260
With some effort, we could probably find a list of companies that
2261
operate locomotives in the area of observation.
2262
Or we could interview an expert in rail shipping to gather information about the typical size of companies.
2263

2264
But even without getting into the specifics of railroad economics, we
2265
can make some educated guesses.
2266
In most fields, there are many small
2267
companies, fewer medium-sized companies, and only one or two very
2268
large companies.
2269
In fact, the distribution of company sizes tends to
2270
follow a power law, as Robert Axtell reports in {\it Science} (see
2271
\url{https://sci-hub.tw/10.1126/science.1062081}).
2272

2273
% \url{http://www.sciencemag.org/content/293/5536/1818.full.pdf}
2274

2275
\index{power law}
2276
\index{Axtell, Robert}
2277

2278
This law suggests that if there are 1000 companies with fewer than
2279
10 locomotives, there might be 100 companies with 100 locomotives,
2280
10 companies with 1000, and possibly one company with 10,000 locomotives.
2281

2282
Mathematically, a power law means that the number of companies
2283
with a given size is inversely proportional to size, or
2284
%
2285
\[ \PMF(N) \sim \left( \frac{1}{N} \right)^{\alpha}   \]
2286
%
2287
where $\PMF(N)$ is the probability mass function of $N$ and $\alpha$ is
2288
a parameter that is often near 1.
2289

2290
We can construct a power law prior like this:
2291

2292
\begin{code}
2293
alpha = 1.0
2294
hypos = np.arange(1, 1001)
2295
ps = hypos**(-alpha)
2296
power = Pmf(ps, hypos, name='power law')
2297
power.normalize()
2298
\end{code}
2299

2300
Again, the upper bound is arbitrary, but with a power law prior, the posterior is less sensitive to this choice.
2301

2302
\begin{figure}
2303
% chap04soln.ipynb
2304
\centerline{\includegraphics[width=4in]{figs/fig04-02.pdf}}
2305
\caption{Posterior distributions for the uniform and power law priors
2306
after seeing train 60.}
2307
\label{fig04-02}
2308
\end{figure}
2309

2310
Figure~\ref{fig04-02} shows the new posterior based on the power law prior, compared to the posterior based on the uniform prior, both after seeing train number 60.
2311

2312
With the power law prior, the posterior is less sensitive to the choice of the upper bound.
2313
If we observe trains 30, 60, and 90, the means of the posteriors are
2314

2315
\begin{tabular}{rr}
2316
\toprule
2317
Upper & Posterior \\
2318
Bound & Mean \\
2319
\midrule
2320
  500 & 131 \\
2321
  1000 & 133 \\
2322
  2000 & 134 \\
2323
\bottomrule
2324
\end{tabular}
2325

2326
Now the differences are much smaller.  In fact,
2327
with an arbitrarily large upper bound, the mean converges on 134.
2328

2329
So the power law prior is more realistic, because it is based on
2330
general information about the size of companies, and it behaves better in practice.
2331

2332

2333
\section{Credible intervals}
2334
\label{credible}
2335

2336
So far we have seen two ways to summarize a posterior distribution: the value with the highest posterior probability (the MAP) and the posterior mean.
2337
These are both {\bf point estimates}, that is, single values that estimate the quantity we are interested in.
2338

2339
Another way to summarize posterior distribution is with percentiles.
2340
If you have taken a standardized test, you might be familiar with percentiles.
2341
For example, if your score is the 90th percentile, that means you did as well as or better than 90\% of the people who took the test.
2342

2343
If we are given a value, \py{x}, we can compute its {\bf percentile rank} by finding all values less than or equal to \py{x} and adding up their probabilities.
2344
\py{Pmf} provides a method that does this computation.
2345
So, for example, we can compute the probability that the company has less than or equal to 100 trains:
2346

2347
\begin{code}
2348
power.lt_dist(100)
2349
\end{code}
2350

2351
With a power law prior and a dataset of three trains, the result is about 27\%.
2352
So 100 trains is the 27th percentile.
2353

2354
Going the other way, suppose we want to compute a particular percentile; for example, the median of a distriution is the 50th percentile.
2355
We can compute it by adding up probabilities until the total exceeds 0.5.
2356
Here's a function that does it:
2357

2358
\begin{code}
2359
def quantile(pmf, prob):
2360
    total = 0
2361
    for q, p in pmf.items():
2362
        total += p
2363
        if total >= prob:
2364
            return q
2365
    return np.nan
2366
\end{code}
2367

2368
\py{pmf} represents a normalized distribution.
2369
\py{prob} is the probability of the percentile we want to compute.
2370

2371
The loop uses \py{items}, which iterates the quantities and probabilities in the distribution.
2372
Inside the loop we add up the probabilities of the quantities in order.
2373
When the total equals or exceeds \py{prob}, we return the corresponding quantity.
2374

2375
This function is called \py{quantile} because it computes a quantile rather than a percentile.
2376
The difference is the way we specify \py{prob}.
2377
If \py{prob} is a percentage between 0 and 100, we call the corresponding quantity a percentile.
2378
If \py{prob} is a probability between 0 and 1, we call the corresponding quantity a {\bf quantile}.
2379

2380
Here's how we can use this function to compute the median of the posterior distribution:
2381

2382
\begin{code}
2383
quantile(power, 0.5)
2384
\end{code}
2385

2386
The result, 113 trains, is the median of the posterior distribution.
2387

2388
\py{Pmf} provides a method called \py{quantile} that does the same thing.
2389
We can call it like this to compute the 5th and 9th percentiles:
2390

2391
\begin{code}
2392
power.quantile([0.05, 0.95])
2393
\end{code}
2394

2395
The result is the interval from 91 to 242 trains, which implies:
2396

2397
\begin{itemize}
2398

2399
\item The probability is 5\% that the number of trains is less than or equal to 91.
2400

2401
\item The probability is 5\% that the number of trains is greater than 242.
2402

2403
\end{itemize}
2404

2405
Therefore the probability is 90\% that the number of trains falls between 91 and 242 (excluding 91 and including 242).
2406
For this reason, this interval is called a 90\% {\bf credible interval}.
2407

2408
\py{Pmf} also provides \py{credible_interval}, which computes an interval that contains the given probability.
2409

2410
\begin{code}
2411
power.credible_interval(0.9)
2412
\end{code}
2413

2414

2415

2416

2417
\section{The German tank problem}
2418

2419
During World War II, the Economic Warfare Division of the American
2420
Embassy in London used statistical analysis to estimate German
2421
production of tanks and other equipment.\footnote{Ruggles and Brodie,
2422
  ``An Empirical Approach to Economic Intelligence in World War II,''
2423
  {\em Journal of the American Statistical Association}, Vol. 42,
2424
  No. 237 (March 1947).}
2425

2426
The Western Allies had captured log books, inventories, and repair
2427
records that included chassis and engine serial numbers for individual
2428
tanks.
2429

2430
Analysis of these records indicated that serial numbers were allocated
2431
by manufacturer and tank type in blocks of 100 numbers, that numbers
2432
in each block were used sequentially, and that not all numbers in each
2433
block were used.  So the problem of estimating German tank production
2434
could be reduced, within each block of 100 numbers, to a form of the
2435
locomotive problem.
2436

2437
Based on this insight, American and British analysts produced
2438
estimates substantially lower than estimates from other forms
2439
of intelligence.  And after the war, records indicated that they were
2440
substantially more accurate.
2441

2442
They performed similar analyses for tires, trucks, rockets, and other
2443
equipment, yielding accurate and actionable economic intelligence.
2444

2445
The German tank problem is historically interesting; it is also a nice
2446
example of real-world application of statistical estimation.  So far
2447
many of the examples in this book have been toy problems, but it will
2448
not be long before we start solving real problems.  I think it is an
2449
advantage of Bayesian analysis, especially with the computational
2450
approach we are taking, that it provides such a short path from a
2451
basic introduction to the research frontier.
2452

2453

2454
\section{Informative priors}
2455

2456
Among Bayesians, there are two approaches to choosing prior
2457
distributions.  Some recommend choosing the prior that best represents
2458
background information about the problem; in that case the prior
2459
is said to be {\bf informative}.  The problem with using an informative
2460
prior is that people might use different background information (or
2461
interpret it differently).  So informative priors often seem subjective.
2462
\index{informative prior}
2463

2464
The alternative is a so-called {\bf uninformative prior}, which is
2465
intended to be as unrestricted as possible, in order to let the data
2466
speak for themselves.  In some cases you can identify a unique prior
2467
that has some desirable property, like representing minimal prior
2468
information about the estimated quantity.
2469
\index{uninformative prior}
2470

2471
Uninformative priors are appealing because they seem more
2472
objective.  But I am generally in favor of using informative priors.
2473
Why?  First, Bayesian analysis is always based on
2474
modeling decisions.  Choosing the prior is one of those decisions, but
2475
it is not the only one, and it might not even be the most subjective.
2476
So even if an uninformative prior is more objective, the entire analysis
2477
is still subjective.
2478

2479
\index{modeling}
2480
\index{subjectivity}
2481
\index{objectivity}
2482

2483
Also, for most practical problems, you are likely to be in one of two
2484
regimes: either you have a lot of data or not very much.  If you have
2485
a lot of data, the choice of the prior doesn't matter very much;
2486
informative and uninformative priors yield almost the same results.
2487
We'll see an example like this in the next chapter.
2488

2489
But if, as in the locomotive problem, you don't have much data,
2490
using relevant background information (like the power law distribution)
2491
makes a big difference.
2492
\index{locomotive problem}
2493

2494
And if, as in the German tank problem, you have to make life-and-death
2495
decisions based on your results, you should probably use all of the
2496
information at your disposal, rather than maintaining the illusion of
2497
objectivity by pretending to know less than you do.
2498
\index{German tank problem}
2499

2500

2501
\section{Exercises}
2502

2503
The code for this chapter is in \py{chap04.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
2504
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap04.ipynb}.
2505

2506
The notebook provides space where you can work on the following problems.
2507

2508

2509
\begin{exercise}
2510
Suppose you are giving a talk in a large lecture hall and you want to estimate the number of people in the audience.  There are too many to count, so you ask how many people were born on May 11 and two people raise their hands.  You ask how many were born on May 23 and 1 person raises their hand.  Finally, you ask how many were born on August 1, and no one raises their hand.
2511

2512
How many people are in the audience?  What is the 90\% credible interval for your estimate?  Hint: Remember the binomial distribution.
2513
\end{exercise}
2514

2515

2516
\begin{exercise}
2517
I often see rabbits in the garden behind my house, but it's not easy to tell them apart, so I don't really know how many there are.
2518

2519
Suppose I deploy a motion-sensing camera trap that takes a picture of the first rabbit it sees each day.  After three days, I compare the pictures and conclude that two of them are the same rabbit and the other is different.
2520

2521
How many rabbits visit my garden?
2522

2523
To answer this question, we have to think about the prior distribution and the likelihood of the data:
2524

2525
\begin{itemize}
2526

2527
\item I have sometimes seen four rabbits at the same time, so I know there are at least that many.  I would be surprised if there were more than 10.  So, at least as a starting place, I think a uniform prior from 4 to 10 is reasonable.
2528

2529
\item To keep things simple, let's assume that all rabbits who visit my garden are equally likely to be caught by the camera trap in a given day.  Let's also assume it is guaranteed that the camera trap gets a picture every day.
2530

2531
\end{itemize}
2532

2533
\end{exercise}
2534

2535
\begin{exercise}
2536
Suppose that in the criminal justice system, all prison sentences are either 1, 2, or 3 years, with an equal number of each.  One day, you visit a prison and choose a prisoner at random.  What is the probability that they are serving a 3-year sentence?  What is the average remaining sentence of the prisoners you observe?
2537
\end{exercise}
2538

2539

2540
\begin{exercise}
2541
If I chose a random adult in the U.S., what is the probability that they have a sibling? To be precise, what is the probability that their mother has had at least one other child?
2542

2543
This article from the Pew Research Center provides some relevant data: \url{https://www.pewsocialtrends.org/2015/05/07/family-size-among-mothers}.  You will have to make some simplifying assumptions.
2544
\end{exercise}
2545

2546

2547
\begin{exercise}
2548
The Doomsday argument is ``a probabilistic argument that claims to predict the number of future members of the human species given an estimate of the total number of humans born so far.''  See \url{https://en.wikipedia.org/wiki/Doomsday_argument}.
2549

2550
Suppose there are only two kinds of civilizations that can happen in the universe. The ``short-lived'' kind go exinct after only 200 billion individuals are born. The ``long-lived'' kind survive until 2,000 billion individuals are born. And suppose that the two kinds of civilization are equally likely.  Which kind of civilization do you think we live in?
2551

2552
The Doomsday argument says we can use the total number of humans born so far as evidence.
2553
According to the Population Reference Bureau, the total number of people who have ever lived is about 108 billion.
2554

2555
Since you were born quite recently, let's assume that you are, in fact, human being number 108 billion.
2556
If $N$ is the total number who will ever live and we consider you to be a randomly-chosen person, it is equally likely that you could have been person 1, or $N$, or any number in between.
2557
So what is the probability that you would be number 108 billion?
2558

2559
Given this data and dubious prior, what is the probability that our civilization will be short-lived?
2560

2561
\end{exercise}
2562

2563

2564

2565
%\begin{exercise}
2566
%To write a likelihood function for the locomotive problem, we had
2567
%to answer this question:  ``If the railroad has $N$ locomotives, what
2568
%is the probability that we see number 60?''
2569
%
2570
%The answer depends on what sampling process we use when we observe the
2571
%locomotive.  In this chapter, I resolved the ambiguity by specifying
2572
%that there is only one train-operating company (or only one that we
2573
%care about).
2574
%
2575
%But suppose instead that there are many companies with different
2576
%numbers of trains.  And suppose that you are equally likely to see any
2577
%train operated by any company.
2578
%In that case, the likelihood function is different because you
2579
%are more likely to see a train operated by a large company.
2580
%
2581
%As an exercise, implement the likelihood function for this variation
2582
%of the locomotive problem, and compare the results.
2583
%
2584
%# Solution
2585
%
2586
%# Suppose Company A has N trains and all other companies have M.
2587
%# The chance that we would observe one of Company A's trains is
2588
%# $N/(N+M)$.
2589
%
2590
%# Given that we observe one of Company A's trains, the chance that we
2591
%# observe number 60 is $1/N$ for $N \ge 60$.
2592
%
2593
%# The product of these probabilities is $1/(N+M)$, which is the
2594
%# probability of observing any given train.
2595
%
2596
%# If N<<M, this converges to a constant, which means that all values
2597
%# of $N$ have the same likelihood, so we learn nothing about how many
2598
%# trains Company A has.
2599
%
2600
%# If N>>M, this converges to $1/N$, which is what we saw in the
2601
%# previous solution.
2602
%
2603
%# More generally, if M is unknown, we would need a prior distribution
2604
%# for M, then we can do a two-dimensional update, and then extract the posterior
2605
%# distribution for N.
2606
%
2607
%# We'll see how to do that soon.
2608
%\end{exercise}
2609

2610

2611

2612

2613
\chapter{Odds and Addends}
2614

2615

2616
This chapter presents a new way to represent a degree of certainty, called ``odds'', and a new form of Bayes's Theorem, called Bayes's Rule.
2617
Bayes's Rule is convenient if you want to do a Bayesian update on paper or in your head.
2618
It also sheds light on the important idea of ``evidence'' and how we can quantify the strength of evidence.
2619

2620
The second part of the chapter is about ``addends'', that is, quantities being added, and how we can compute their distributions.
2621
We'll define functions that compute the distribution of a sum, difference, or result of another operation.
2622
And then we'll use those distributions as part of a Bayesian update.
2623

2624
As an exercise, you'll have a chance to solve the Congress problem:
2625

2626
\begin{quote}
2627
There are 538 members of the United States Congress.
2628
Suppose we audit their investment portfolios and find that 312 of them outperform the market.
2629
Let's assume that an honest member of Congress has only a 50\% chance of outperforming the market, but a dishonest member who trades on inside information has a 90\% chance.  How many members of Congress are honest?
2630
\end{quote}
2631

2632

2633
\section{Odds}
2634

2635
One way to represent a degree of certainty is a probability in the form of a number between 0 and 1, but that's not the only way.
2636
If you have ever bet on a football game or a horse race, you might have encountered another representation of certainty, called {\bf odds}.
2637

2638
\index{odds}
2639

2640
You might have heard expressions like ``the odds are
2641
three to one,'' but you might not know what that means.
2642
The {\bf odds in favor} of an event are the ratio of the probability
2643
it will occur to the probability that it will not.
2644

2645
So if I think my team has a 75\% chance of winning, I would
2646
say that the odds in their favor are three to one, because
2647
the chance of winning is three times the chance of losing.
2648

2649
You can write odds in decimal form, but it is also common to
2650
write them as a ratio of integers.  So ``three to one'' is
2651
written $3:1$.
2652

2653
When probabilities are low, it is more common to report the
2654
{\bf odds against} rather than the odds in favor.  For
2655
example, if I think my horse has a 10\% chance of winning,
2656
I would say that the odds against are $9:1$.
2657

2658
Probabilities and odds are different representations of the
2659
same information.  Given a probability, you can compute the
2660
odds like this:
2661

2662
\begin{code}
2663
def odds(p):
2664
    return p / (1-p)
2665
\end{code}
2666

2667
Given the odds in favor, in decimal form, you can convert to probability like this:
2668

2669
\begin{code}
2670
def prob(o):
2671
    return o / (o+1)
2672
\end{code}
2673

2674
If you represent odds with a numerator and denominator, you
2675
can convert to probability like this:
2676

2677
\begin{code}
2678
def prob2(yes, no):
2679
    return yes / (yes + no)
2680
\end{code}
2681

2682
When I work with odds in my head, I find it helpful to picture
2683
people at the track.  If 20\% of them think my horse will win,
2684
then 80\% of them don't, so the odds in favor are $20:80$ or
2685
$1:4$.
2686

2687
If the odds are $5:1$ against my horse, then five out of six
2688
people think she will lose, so the probability of winning
2689
is $1/6$.
2690

2691
\index{horse racing}
2692

2693

2694
\section{Bayes's Rule}
2695

2696
\index{Bayes's Rule}
2697

2698
In Chapter~\ref{intro} I wrote Bayes's theorem in the {\bf probability
2699
form}:
2700
%
2701
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
2702
%
2703
If we have two hypotheses, $A$ and $B$,
2704
we can write the ratio of posterior probabilities like this:
2705
%
2706
\[ \frac{\p{A|D}}{\p{B|D}} = \frac{\p{A}~\p{D|A}}
2707
                                        {\p{B}~\p{D|B}} \]
2708
%
2709
Notice that the total probability of the data, \p{D}, drops out of
2710
this equation.
2711

2712
Writing \odds{A} for odds in favor of $A$, we use the definition of odds to write:
2713
%
2714
\[ \odds{A} = \frac{\p{A}}{1-\p{A}} \]
2715
%
2716
If $A$ and $B$ are mutually exclusive and collectively exhaustive,
2717
that means $\p{B} = 1 - \p{A}$, so we can write
2718
%
2719
\[ \odds{A} = \frac{\p{A}}{\p{B}}  \]
2720
%
2721
By the same process, we can write the posterior odds like this:
2722
%
2723
\[ \odds{A|D} = \frac{\p{A|D}}{\p{B|D}}  \]
2724
%
2725
Putting it all together, we have:
2726
%
2727
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
2728
%
2729
This is Bayes's Rule, which says that the posterior odds are the prior odds times the likelihood ratio.
2730

2731
This form of Bayes's Theorem is convenient for computing a Bayesian update on paper or in your head.
2732
For example, let's go back to the cookie problem:
2733
\index{cookie problem}
2734

2735
\begin{quote}
2736
Suppose there are two bowls of cookies.
2737
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
2738
Bowl 2 contains 20 of each.
2739

2740
Now suppose you choose one of the bowls at random and, without
2741
looking, select a cookie at random.
2742
The cookie is vanilla.
2743
What is the probability that it came from Bowl 1?
2744
\end{quote}
2745

2746
The prior probability is 50\%, so the prior odds are $1$.
2747
The likelihood ratio is $\frac{3}{4} / \frac{1}{2}$, or $3/2$.
2748
So the posterior odds are $3/2$, which corresponds to probability
2749
$3/5$.
2750

2751

2752
\section{Oliver's blood}
2753
\label{oliver}
2754

2755
\index{Oliver's blood problem}
2756
\index{MacKay, David}
2757

2758
I'll use Bayes's Rule to solve another problem from MacKay's {\it Information Theory, Inference, and Learning Algorithms}:
2759

2760
\begin{quote}
2761
Two people have left traces of their own blood at the scene of
2762
a crime.  A suspect, Oliver, is tested and found to have type
2763
`O' blood.  The blood groups of the two traces are found to
2764
be of type `O' (a common type in the local population, having frequency
2765
60\%) and of type `AB' (a rare type, with frequency 1\%).
2766
Do these data [the traces found at the scene] give evidence
2767
in favor of the proposition that Oliver was one of the people
2768
[who left blood at the scene]?
2769
\end{quote}
2770

2771
To answer this question, we need to think about what it means
2772
for data to give evidence in favor of (or against) a hypothesis.
2773
Intuitively, we might say that data favor a hypothesis if the
2774
hypothesis is more likely in light of the data than it was before.
2775

2776
\index{evidence}
2777

2778
In the cookie problem, the prior odds are $1$, or probability 50\%.
2779
The posterior odds are $3/2$, or probability 60\%.
2780
So the vanilla cookie is evidence in favor of Bowl 1.
2781

2782
Bayes's Rule provides a way to make this intuition more precise.  Again
2783
%
2784
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
2785
%
2786
Dividing through by \odds{A}, we get:
2787
%
2788
\[ \frac{\odds{A|D}}{\odds{A}} = \frac{\p{D|A}}{\p{D|B}} \]
2789
%
2790
The term on the left is the ratio of the posterior and prior odds.
2791
The term on the right is the likelihood ratio, also called the {\bf Bayes
2792
factor}.
2793

2794
\index{likelihood ratio}
2795
\index{Bayes factor}
2796

2797
If the Bayes factor is greater than 1, that means that the
2798
data were more likely under $A$ than under $B$.
2799
And that means that the odds are greater, in light of the data, than they were before.
2800

2801
If the Bayes factor is less than 1, that means the data were
2802
less likely under $A$ than under $B$, so the odds in
2803
favor of $A$ go down.
2804

2805
Finally, if the Bayes factor is exactly 1, the data are equally
2806
likely under either hypothesis, so the odds do not change.
2807

2808
Let's apply that to the problem at hand.  If Oliver is
2809
one of the people who left blood at the crime scene, he
2810
accounts for the `O' sample; in that case, the probability of the data
2811
is the probability that a random member of the population
2812
has type `AB' blood, which is 1\%.
2813

2814
If Oliver did not leave blood at the scene, we have two
2815
samples to account for.  If we choose two random people from
2816
the population, what is the chance of finding one with type `O'
2817
and one with type `AB'?  Well, there are two ways it might happen:
2818
the first person might have type `O' and the second
2819
`AB', or the other way around.  So the total probability is
2820
$2 (0.6) (0.01) = 1.2\%$.
2821

2822
The likelihood of the data is slightly higher if Oliver is
2823
{\it not} one of the people who left blood at the scene, so
2824
the blood data is actually evidence against Oliver's guilt.
2825

2826
\index{evidence}
2827

2828
This example is a little contrived, but it is demonstrates
2829
the counterintuitive result that data {\it consistent} with
2830
a hypothesis are not necessarily {\it in favor of}
2831
the hypothesis.
2832

2833
If this result still bothers you, this way of thinking might help: the data consist of a common event, type `O' blood, and a rare event, type `AB' blood.
2834
If Oliver accounts for the common event, that leaves the rare
2835
event unexplained.  If Oliver doesn't account for the
2836
`O' blood, we have two chances to find someone in the
2837
population with `AB' blood.  And that factor of two makes
2838
the difference.
2839

2840

2841
\section{Addends}
2842
\label{addends}
2843

2844
Suppose you roll two dice and add them up.  What is the distribution of the sum?
2845
I'll use the following function to create a \py{Pmf} that represents the outcome of a die:
2846

2847
\begin{code}
2848
def make_die(sides):
2849
    outcomes = np.arange(1, sides+1)
2850
    die = Pmf(1/sides, outcomes)
2851
    return die
2852
\end{code}
2853

2854
On a six-sided die, there are six possible outcomes, 1 through 6, all equally likely.
2855

2856
\begin{code}
2857
die = make_die(6)
2858
\end{code}
2859

2860
If we roll two dice and add them up, there are 11 possible outcomes, 2 through 12, but they are not equally likely.
2861
To compute the distribution of the sum, we can enumerate the possible outcomes.
2862
The following loop enumerates the quantities and probabilities from a \py{Pmf}:
2863

2864
\begin{code}
2865
for q, p in die.items():
2866
    print(q, p)
2867
\end{code}
2868

2869
\py{items} iterates though the quantities and probabilities in the \py{Pmf}.
2870
So this loop enumerates all pairs of quantities and their probabilities:
2871

2872
\begin{code}
2873
for q1, p1 in pmf1.items():
2874
    for q2, p2 in pmf2.items():
2875
        q = q1 + q2
2876
        p = p1 * p2
2877
\end{code}
2878

2879
Each time through the loop \py{q} gets the sum of the pair of quantities, and \py{p} gets the probability of the pair.
2880
Because the same sum might appear more than once, we have to add up the total probability for each sum.
2881
And that's how this function works:
2882

2883
\begin{code}
2884
def add_dist(pmf1, pmf2):
2885
    res = Pmf()
2886
    for q1, p1 in pmf1.items():
2887
        for q2, p2 in pmf2.items():
2888
            q = q1 + q2
2889
            p = p1 * p2
2890
            res[q] = res(q) + p
2891
    return res
2892
\end{code}
2893

2894
The parameters are \py{Pmf} objects representing distributions.
2895
The first line creates an empty \py{Pmf}.
2896
Each time through the loop, we compute \py{q} and \py{p} and then increment the probability associated with \py{q}.
2897

2898
Notice a subtle element of this line:
2899

2900
\begin{code}
2901
            res[q] = res(q) + p
2902
\end{code}
2903

2904
I use parentheses on the right side of the assignment, which returns 0 if \py{q} does not appear yet in \py{res}.
2905
I use brackets on the left side of the assignment to create or update an element in \py{res}; using parentheses on the left side would not work.
2906

2907
\py{Pmf} provides a method that does the same thing.
2908
You can call it as a method, like this.
2909

2910
\begin{code}
2911
twice = die.add_dist(die)
2912
\end{code}
2913

2914
Or as a function, like this:
2915

2916
\begin{code}
2917
twice = Pmf.add_dist(die, die)
2918
\end{code}
2919

2920
If we have a sequence of \py{Pmf} objects that represent dice, we can compute the distribution of the sum like this:
2921

2922
\begin{code}
2923
def add_dist_seq(seq):
2924
    total = seq[0]
2925
    for other in seq[1:]:
2926
        total = total.add_dist(other)
2927
    return total
2928
\end{code}
2929

2930
So we can compute the sum of three dice like this:
2931

2932
\begin{code}
2933
dice = [die] * 3
2934
thrice = add_dist_seq(dice)
2935
\end{code}
2936

2937
Figure~\ref{fig05-01} shows what these three distributions look like:
2938

2939
\begin{itemize}
2940

2941
\item The distribution of a single die is uniform from 1 to 6.
2942

2943
\item The sum of two dice has a triangle distribution between 2 and 12.
2944

2945
\item The sum of three dice has a bell-shaped distribution between 3 and 18.
2946

2947
\end{itemize}
2948

2949
\begin{figure}
2950
% chap05soln.ipynb
2951
\centerline{\includegraphics[width=4in]{figs/fig05-01.pdf}}
2952
\caption{Distribution of outcomes for one six-sided die, two dice, and three dice.}
2953
\label{fig05-01}
2954
\end{figure}
2955

2956
As an aside, this example demonstrates the Central Limit Theorem, which says that the distribution of a sum converges on a bell-shaped normal distribution, at least under some conditions.
2957

2958
\section{Gluten}
2959

2960
In 2015 I read a paper that tested whether people diagnosed with gluten sensitivity (but not celiac disease) were not able to distinguish gluten flour from non-gluten flour in a blind challenge (\url{https://onlinelibrary.wiley.com/doi/full/10.1111/apt.13372}).
2961

2962
Out of 35 subjects, 12 correctly identified the gluten flour based on resumption of symptoms while they were eating it.  Another 17 wrongly identified the gluten-free flour based on their symptoms, and 6 were unable to distinguish.
2963

2964
The authors conclude, ``Double-blind gluten challenge induces symptom recurrence in just one-third of patients.''
2965

2966
This conclusion seems odd to me, because if none of the patients were sensitive to gluten, we would expect some of them to identify the gluten flour by chance.
2967
So here's the question: based on this data, how many of the subjects are sensitive to gluten?
2968

2969
We can use Bayes's Theorem to answer this question, but first we have to make some modeling decisions.
2970
I'll assume:
2971

2972
\begin{itemize}
2973

2974
\item People who are sensitive to gluten have a 95\% chance of correctly identifying gluten flour under the challenge conditions, and
2975

2976
\item People who are not sensitive have a 40\% chance of identifying the gluten flour by chance (and a 60\% chance of either choosing the other flour or failing to distinguish).
2977

2978
\end{itemize}
2979

2980
These particular values are arbitrary, but the results are not sensitive to these choices.
2981

2982
I will solve this problem in two steps.  First, assuming that we know how many subjects are sensitive, I will compute the distribution of the data.  Then, using the likelihood of the data, I will compute the posterior distribution of the number of sensitive patients.
2983

2984
The first is the {\bf forward problem}; the second is the {\bf inverse problem}.
2985

2986

2987
\section{Forward problem}
2988

2989
Suppose we know that 10 of the 35 subjects are sensitive to gluten.  That means that 25 are not:
2990

2991
\begin{code}
2992
n = 35
2993
n_sensitive = 10
2994
n_insensitive = n - n_sensitive
2995
\end{code}
2996

2997
Each sensitive subject has a 95\% chance of identifying the gluten flour, so the number of correct identifications follows a binomial distribution with \py{p=0.95}:
2998

2999
\begin{code}
3000
dist_sensitive = make_binomial(n_sensitive, 0.95)
3001
\end{code}
3002

3003
And similarly for the insensitive subjects:
3004

3005
\begin{code}
3006
dist_insensitive = make_binomial(n_insensitive, 0.4)
3007
\end{code}
3008

3009
\py{make_binomial} returns a \py{Pmf} that represents the distribution of correct identifications.
3010
So we can use \py{add_dist} to compute the total number of correct identifications in both groups:
3011

3012
\begin{code}
3013
dist_total = Pmf.add_dist(dist_sensitive, dist_insensitive)
3014
\end{code}
3015

3016
Figure~\ref{fig05-02} shows the distribution of correct identifications among sensitive and insensitive subjects, and the total.
3017

3018
\begin{figure}
3019
% chap02soln.ipynb
3020
\centerline{\includegraphics[width=4in]{figs/fig05-02.pdf}}
3021
\caption{Distribution of correct identifications among sensitive and insensitive subjects, and the total.}
3022
\label{fig05-02}
3023
\end{figure}
3024

3025
Of the 10 sensitive subject, we expect most of them to identify the gluten flour correctly.
3026
Of the 25 insensitive subjects, we expect about 10 to identify the gluten flour by chance.
3027
So we expect about 20 correct identifications in total.
3028

3029
This is the answer to the forward problem: given the number of sensitive subjects, we can compute the distribution of the data.
3030

3031
\section{Inverse Problem}
3032

3033
Now let's solve the inverse problem: given the data, we'll compute the posterior distribution of the number of sensitive subjects.
3034

3035
Here's how.  I'll loop through the possible values of \py{n_sensitive} and compute the distribution of the data for each:
3036

3037
\begin{code}
3038
table = pd.DataFrame()
3039
for n_sensitive in range(1, n):
3040
    n_insensitive = n - n_sensitive
3041

3042
    dist_sensitive = make_binomial(n_sensitive, 0.95)
3043
    dist_insensitive = make_binomial(n_insensitive, 0.4)
3044
    dist_total = Pmf.add_dist(dist_sensitive, dist_insensitive)
3045
    table[n_sensitive] = dist_total
3046
\end{code}
3047

3048
I store each distribution as a column in a Pandas DataFrame.
3049
When \py{n_sensitive} is 0 or \py{n}, the distribution of the data is a simple binomial, not the sum of two binomials:
3050

3051
\begin{code}
3052
table[0] = make_binomial(n, 0.4)
3053
table[n] = make_binomial(n, 0.95)
3054
\end{code}
3055

3056
Figure~\ref{fig05-03} shows several columns from this table, corresponding to several hypothetical values of \py{n_sensitive}:
3057

3058
\begin{figure}
3059
% chap05soln.ipynb
3060
\centerline{\includegraphics[width=4in]{figs/fig05-03.pdf}}
3061
\caption{Distribution of the number of correct identification for different values of \py{n_sensitive}.}
3062
\label{fig05-03}
3063
\end{figure}
3064

3065
Now we can use this table to compute the likelihood of the data:
3066

3067
\begin{code}
3068
likelihood = table.loc[12]
3069
\end{code}
3070

3071
\py{loc} selects a row from the table.
3072
The row with index 12 contains the probability of 12 correct identifications for each hypothetical value of \py{n_sensitive}.
3073
And that's exactly the likelihood we need to do a Bayesian update.
3074

3075
I'll use a uniform prior, which implies that I would be equally surprised by any value of \py{n_sensitive}:
3076

3077
\begin{code}
3078
hypos = np.arange(n+1)
3079
prior = Pmf(1, hypos)
3080
\end{code}
3081

3082
And here's the update:
3083

3084
\begin{code}
3085
posterior = prior * likelihood
3086
posterior.normalize()
3087
\end{code}
3088

3089
Figure~\ref{fig05-04} shows posterior distributions of \py{n_sensitive} based on the actual data, 12 correct identifications, and another hypothetical outcome, 20 correct identifications.
3090

3091
\begin{figure}
3092
% chap05soln.ipynb
3093
\centerline{\includegraphics[width=4in]{figs/fig05-04.pdf}}
3094
\caption{Posterior distributions of \py{n_sensitive}.}
3095
\label{fig05-04}
3096
\end{figure}
3097

3098
With 12 correct identifications, the most likely conclusion is that none of the subjects are sensitive to gluten.
3099
If there had been 20 correct identifications, the most likely conclusion would be that 11-12 of the subjects were sensitive.
3100

3101

3102
\section{Summary}
3103

3104
This chapter presents two topics that are almost unrelated except that they make the title of the chapter catchy.
3105

3106
The first part of the chapter is about Bayes's Rule, evidence, and how we can quantify the strength of evidence using a likelihood ratio or Bayes factor.
3107

3108
The second part is about functions that compute the distribution of a sum, product, or the result of another binary operation.
3109
We can use these functions to solve a forward problem and inverse problems; that is, given the parameters of a system, we can compute the distribution of the data or, given the data, we can compute the distribution of the parameters.
3110

3111
In the following exercises, you'll have a chance to practice what you learned.
3112

3113

3114
\section{Exercises}
3115

3116
The code for this chapter is in \py{chap05.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
3117
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap05.ipynb}.
3118

3119
The notebook provides space where you can work on the following problems.
3120

3121

3122
\begin{exercise}
3123
Let's use Bayes's Rule to solve the Elvis problem from Section~\ref{elvis}:
3124

3125
\begin{quote}
3126
Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?
3127
\end{quote}
3128

3129
In 1935, about 2/3 of twins were fraternal and 1/3 were identical.
3130
The question contains two pieces of information we can use to update this prior.
3131
First, Elvis's twin was also male, which is more likely if they were identical twins, with a likelihood ratio of 2.
3132
Also, Elvis's twin died at birth, which is more likely if they were identical twins, with a likelihood ratio of 1.25.
3133

3134
If you are curious about where those number come from, I wrote a blog post about it at \url{https://www.allendowney.com/blog/2020/01/28/the-elvis-problem-revisited}.
3135
\end{exercise}
3136

3137

3138
\begin{exercise}
3139
The following is an interview question that appeared on glassdoor.com, attributed to Facebook (\url{https://www.glassdoor.com/Interview/You-re-about-to-get-on-a-plane-to-Seattle-You-want-to-know-if-you-should-bring-an-umbrella-You-call-3-random-friends-of-y-QTN_519262.htm}):
3140

3141
\begin{quote}
3142
You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that ``Yes'' it is raining. What is the probability that it's actually raining in Seattle?
3143
\end{quote}
3144

3145
Use Bayes's Rule to solve this problem.  As a prior you can assume that it rains in Seattle about 10\% of the time.
3146
\end{exercise}
3147

3148

3149
\begin{exercise}
3150
According to the CDC, people who smoke are about 25 times more likely to develop lung cancer than nonsmokers (see \url{https://www.cdc.gov/tobacco/data_statistics/fact_sheets/health_effects/effects_cig_smoking/}).
3151

3152
Also according to the CDC, about 14\% of adults in the U.S. are smokers (see \url{https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm}).
3153

3154
If you learn that someone has lung cancer, what is the probability they are a smoker?
3155
\end{exercise}
3156

3157

3158
\begin{exercise}
3159
In {\it Dungeons~\&~Dragons}, the amount of damage a goblin can withstand is the sum of two six-sided dice. The amount of damage you inflict with a short sword is determined by rolling one six-sided die.
3160
A goblin is defeated if the total damage you inflict is greater than or equal to the amount it can withstand.
3161

3162
Suppose you are fighting a goblin and you have already inflicted 3 points of damage. What is your probability of defeating the goblin with your next successful attack?
3163

3164
Hint: You can use \py{Pmf.add_dist} to add a constant amount, like 3, to a \py{Pmf}.
3165
\end{exercise}
3166

3167

3168
\begin{exercise}
3169
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
3170
I choose one of the dice at random, roll it twice, multiply the outcomes, and report that the product is 12.
3171
What is the probability that I chose the 8-sided die?
3172

3173
Hint: \py{Pmf} provides a function called \py{mul_dist} that takes two \py{Pmf} objects and returns a \py{Pmf} that represents the distribution of the product.
3174
\end{exercise}
3175

3176

3177
\begin{exercise}
3178
{\it Betrayal at House on the Hill} is a strategy game in which characters with different attributes explore a haunted house.  Depending on their attributes, the characters roll different numbers of dice.  For example, if attempting a task that depends on knowledge, Professor Longfellow rolls 5 dice, Madame Zostra rolls 4, and Ox Bellows rolls 3.  Each die yields 0, 1, or 2 with equal probability.
3179

3180
If a randomly chosen character attempts a task three times and rolls a total of 3 on the first attempt, 4 on the second, and 5 on the third, which character do you think it was?
3181
\end{exercise}
3182

3183

3184
\begin{exercise}
3185
There are 538 members of the United States Congress.
3186
Suppose we audit their investment portfolios and find that 312 of them outperform the market.
3187
Let's assume that an honest member of Congress has only a 50\% chance of outperforming the market, but a dishonest member who trades on inside information has a 90\% chance.  How many members of Congress are honest?
3188
\end{exercise}
3189

3190

3191

3192
\chapter{Minima, Maxima, and Mixtures}
3193

3194
In the previous chapter we computed distributions of sums, differences, products, and quotients.
3195

3196
In this chapter, we'll compute distributions of minima and maxima use them to solve inference problems.
3197
Then we'll look at distributions that are mixtures of other distributions, which will turn out to be particularly useful for making predictions.
3198

3199
But we'll start with a powerful tool for working with distributions, the cumulative distribution function.
3200

3201
\section{Cumulative distribution functions}
3202

3203
So far we have been using probability mass functions to represent distributions.
3204
A useful alternative is the {\bf cumulative distribution function}, or CDF.
3205

3206
As an example, I'll use the posterior distribution from the Euro problem, which we computed in Section~\ref{binomlike}.
3207

3208
\begin{code}
3209
hypos = np.linspace(0, 1, 101)
3210
pmf = Pmf(1, hypos)
3211
data = 140, 250
3212
update_binomial(pmf, data)
3213
\end{code}
3214

3215
The CDF is the cumulative sum of the PMF, so we can compute it like this:
3216

3217
\begin{code}
3218
cumulative = pmf.cumsum()
3219
\end{code}
3220

3221
The result is a Pandas Series, so we can use the bracket operator to select an element:
3222

3223
\begin{code}
3224
cumulative[0.61]
3225
\end{code}
3226

3227
The result is about 0.96, which means that the total probability of all quantities less than or equal to 0.61 is 96\%.
3228

3229
To go the other way --- to look up a probability and get the corresponding quantile --- we can use interpolation:
3230

3231
\begin{code}
3232
from scipy.interpolate import interp1d
3233

3234
ps = cumulative.values
3235
qs = cumulative.index
3236

3237
interp = interp1d(ps, qs)
3238
interp(0.96)
3239
\end{code}
3240

3241
The result is about 0.61, so that confirms that the 96th percentile of this distribution is 0.61.
3242

3243
\py{empiricaldist} provides a class called \py{Cdf} that represents a cumulative distribution function.
3244
Given a \py{Pmf}, you can compute a \py{Cdf} like this:
3245

3246
\begin{code}
3247
cdf = pmf.make_cdf()
3248
\end{code}
3249

3250
\py{make_cdf} uses \py{np.cumsum} to compute the cumulative sum of the probabilities.
3251
Figure~\ref{fig06-01} shows the PMF and CDF of this distribution.
3252
The range of the CDF is always from 0 to 1, in contrast with the PMF, where the maximum can be any probability.
3253

3254
\begin{figure}
3255
% chap06soln.ipynb
3256
\centerline{\includegraphics[width=4in]{figs/fig06-01.pdf}}
3257
\caption{Posterior distribution from the Euro problem represented as a PMF and CDF.}
3258
\label{fig06-01}
3259
\end{figure}
3260

3261
You can use brackets to select an element from a \py{Cdf}:
3262

3263
\begin{code}
3264
cdf[0.61]
3265
\end{code}
3266

3267
But if you look up a value that's not in the distribution, you get a \py{KeyError}.
3268
You can also call a \py{Cdf} as a function, using parentheses.
3269
If the argument does not appear in the \py{Cdf}, it interpolates between quantities.
3270

3271
\begin{code}
3272
cdf(0.615)
3273
\end{code}
3274

3275
Going the other way, you can use \py{quantile} to look up a cumulative probability and get the corresponding quantity:
3276

3277
\begin{code}
3278
cdf.quantile(0.96)
3279
\end{code}
3280

3281
\py{Cdf} also provides \py{credible_interval}, which computes a credible interval that contains the given probability:
3282

3283
\begin{code}
3284
cdf.credible_interval(0.9)
3285
\end{code}
3286

3287
CDFs and PMFs are equivalent in the sense that they contain the
3288
same information about the distribution, and you can always convert
3289
from one to the other.
3290
Given a \py{Cdf}, you can get the equivalent \py{Pmf} like this:
3291

3292
\begin{code}
3293
pmf = cdf.make_pmf()
3294
\end{code}
3295

3296
\py{make_pmf} uses \py{np.diff} to compute differences between consecutive cumulative probabilities.
3297

3298
One reason \py{Cdf} objects are useful is that they compute quantiles efficiently.
3299
Another is that they make it easy to compute the distribution of a maximum or minimum, as we'll see in the next section.
3300

3301

3302
\section{Best Three of Four}
3303

3304
In {\it Dungeons~\&~Dragons}, each character has six attributes: strength, intelligence, wisdom, dexterity, constitution, and charisma.
3305

3306
To generate a new character, players roll four 6-sided dice for each attribute and add up the best three.
3307
For example, if I roll for strength and get 1, 2, 3, 4 on the dice, my character's strength would be 9.
3308

3309
As an exercise, let's figure out the distribution of these attributes.
3310
Then, for each character, we'll figure out the distribution of their best attribute.
3311

3312
In Section~\ref{addends}, we computed the distribution of the sum of three dice like this:
3313

3314
\begin{code}
3315
die = make_die(6)
3316
dice = [die] * 3
3317
pmf_3d6 = add_dist_seq(dice)
3318
\end{code}
3319

3320
The definitions of \py{make_die} and \py{add_dist_seq} are in that section.
3321

3322
But if we roll four dice and add up the best three, computing the distribution of the sum is a bit more complicated.
3323
I'll estimate the distribution by simulating 10,000 rolls.
3324

3325
First I'll create an array of random values from 1 to 6, with 10,000 rows and 4 columns:
3326

3327
\begin{code}
3328
n = 10000
3329
a = np.random.randint(1, 7, size=(n, 4))
3330
\end{code}
3331

3332
To find the best three outcomes in each row, I'll sort along \py{axis=1}, which means across the columns.
3333

3334
\begin{code}
3335
a.sort(axis=1)
3336
\end{code}
3337

3338
Finally, I'll select the last three columns and add them up.
3339

3340
\begin{code}
3341
t = a[:, 1:].sum(axis=1)
3342
\end{code}
3343

3344
Now \py{t} is an array with a single column and 10,000 rows.
3345
We can compute the PMF of the values in \py{t} like this:
3346

3347
\begin{code}
3348
pmf_4d6 = Pmf.from_seq(t)
3349
\end{code}
3350

3351
Figure~\ref{fig06-02} shows the distribution of the sum of three dice, \py{pmf_3d6}, and the distribution of the best three out of four, \py{pmf_4d6}.
3352

3353
\begin{figure}
3354
% chap06soln.ipynb
3355
\centerline{\includegraphics[width=4in]{figs/fig06-02.pdf}}
3356
\caption{Distributions of the sum of three dice and the best three of four.}
3357
\label{fig06-02}
3358
\end{figure}
3359

3360
As you might expect, choosing the best three out of four tends to yield higher values.
3361

3362
Next we'll find the distribution for the maximum of six attributes, each the sum of the best three of four dice.
3363

3364

3365
\section{Maximum of Six}
3366

3367
To compute the distribution of a maximum or minimum, we can make good use of the cumulative distribution function.
3368
First, I'll compute the \py{Cdf} of the best three of four distribution:
3369

3370
\begin{code}
3371
cdf_4d6 = pmf_4d6.make_cdf()
3372
\end{code}
3373

3374
Recall that \py{Cdf(x)} is the sum of probabilities for quantities less than or equal to \py{x}.
3375
Equivalently, it is the probability that a random value chosen from the distribution is less than or equal to \py{x}.
3376

3377
Now suppose I draw 6 values from this distribution.
3378
The probability that all 6 of them are less than or equal to \py{x} is \py{Cdf(x)} raised to the 6th power, which we can compute like this:
3379

3380
\begin{code}
3381
cdf_4d6**6
3382
\end{code}
3383

3384
If all 6 values are less than or equal to \py{x}, that means that their maximum is less than or equal to \py{x}.
3385
So the result is the CDF of their maximum.
3386
We can convert it to a \py{Cdf} object, like this:
3387

3388
\begin{code}
3389
cdf_max6 = Cdf(cdf_4d6**6)
3390
\end{code}
3391

3392
And compute the equivalent \py{Pmf} like this:
3393

3394
\begin{code}
3395
pmf_max6 = cdf_max6.make_pmf()
3396
\end{code}
3397

3398
Figure~\ref{fig06-03} shows the result.
3399
Most characters have at least one attribute greater than 12; almost 10\% of them have an 18.
3400

3401
\begin{figure}
3402
% chap06soln.ipynb
3403
\centerline{\includegraphics[width=4in]{figs/fig06-03.pdf}}
3404
\caption{Distribution for the minimum and maximum of six attributes.}
3405
\label{fig06-03}
3406
\end{figure}
3407

3408
\py{Pmf} and \py{Cdf} provide \py{max_dist}, which does the same computation.
3409
We can compute the \py{Pmf} of the maximum like this:
3410

3411
\begin{code}
3412
pmf_4d6.max_dist(6)
3413
\end{code}
3414

3415
And the \py{Cdf} of the maximum like this:
3416

3417
\begin{code}
3418
cdf_4d6.max_dist(6)
3419
\end{code}
3420

3421
In the next section we'll find the distribution of the minimum.
3422
The process is similar, but a little more complicated.
3423
See if you can figure it out before you go on.
3424

3425

3426

3427

3428
%In mathematical notation, we use $X$ to represent a random value from a %distribution, so we can write:
3429
%
3430
%\[ \CDF(x) = \p{X \le x} \]
3431
%
3432

3433
\section{Minimum of Six}
3434

3435
In the previous section we computed the distribution of a character's best attribute.
3436
Now let's compute the distribution of the worst.
3437

3438
To compute the distribution of the minimum, we'll use the {\bf complementary CDF}, which we can compute like this:
3439

3440
\begin{code}
3441
prob_gt = 1 - cdf_4d6
3442
\end{code}
3443

3444
As the variable name suggests, the complementary CDF is the probability that a value from the distribution is greater than \py{x}.
3445
If we draw 6 values from the distribution, the probability that all 6 exceed \py{x} is:
3446

3447
\begin{code}
3448
prob_gt6 = prob_gt**6
3449
\end{code}
3450

3451
If all 6 exceed \py{x}, that means their minimum exceeds \py{x}, so \py{prob_gt6} is the complementary CDF of the minimum.
3452
And that means we can compute the CDF of the minimum like this:
3453

3454
\begin{code}
3455
prob_le6 = 1 - prob_gt6
3456
\end{code}
3457

3458
The result is a Pandas Series that represents the CDF of the minimum of six attributes.
3459
We can put those values in a \py{Cdf} object like this:
3460

3461
\begin{code}
3462
cdf_min6 = Cdf(prob_le6)
3463
\end{code}
3464

3465
Figure~\ref{fig06-03} shows the result.
3466

3467
\py{Pmf} and \py{Cdf} provide \py{min_dist}, which does the same computation.
3468
We can compute the \py{Pmf} of the minimum like this:
3469

3470
\begin{code}
3471
pmf_4d6.min_dist(6)
3472
\end{code}
3473

3474
And the \py{Cdf} of the minimum like this:
3475

3476
\begin{code}
3477
cdf_4d6.min_dist(6)
3478
\end{code}
3479

3480
In the exercises at the end of the chapter, you'll use distributions of the minimum and maximum to do Bayesian inference.
3481
But first we'll see what happens when we mix distributions.
3482

3483

3484
\section{Mixtures}
3485
\label{mixture}
3486

3487
Let's do one more example inspired by {\it Dungeons~\&~Dragons}.
3488
Suppose I have a 4-sided die and a 6-sided die.
3489
I choose one of them at random and roll it.
3490
What is the distribution of the outcome?
3491

3492
If you know which die it is, the answer is easy.
3493
A die with \py{n} sides yields a uniform distribution from 1 to \py{n}, including both.
3494
We can compute \py{Pmf} objects to represent the dice, like this:
3495

3496
\begin{code}
3497
d4 = make_die(4)
3498
d6 = make_die(6)
3499
\end{code}
3500

3501
To compute the distribution of the mixture, we can compute the average of the two distributions by adding them and dividing the result by 2:
3502

3503
\begin{code}
3504
total = Pmf.add(d4, d6, fill_value=0) / 2
3505
\end{code}
3506

3507
We have to use \py{Pmf.add} with \py{fill_value=0} because the two distributions don't have the same set of quantities.
3508
If they did, we could use the \py{+} operator.
3509

3510
Now suppose I have a 4-sided die and {\it two} 6-sided dice.
3511
Again, I choose one of them at random and roll it.
3512
What is the distribution of the outcome?
3513

3514
We can solve this problem by computing a weighted average of the distributions, like this:
3515

3516
\begin{code}
3517
total = Pmf.add(d4, 2*d6, fill_value=0) / 3
3518
\end{code}
3519

3520
Finally, suppose we have a box with the following mix:
3521

3522
\begin{verbatim}
3523
1  4-sided die
3524
2  6-sided dice
3525
3  8-sided dice
3526
\end{verbatim}
3527

3528
If I draw a die from this mix at random, we can use a \py{Pmf} to represent the hypothetical number of sides on the die:
3529

3530
\begin{code}
3531
hypos = [4,6,8]
3532
counts = [1,2,3]
3533
pmf_dice = Pmf(counts, hypos)
3534
\end{code}
3535

3536
And I'll make a sequence of \py{Pmf} objects to represent the dice:
3537

3538
\begin{code}
3539
dice = [make_die(sides) for sides in hypos]
3540
\end{code}
3541

3542
Now we have to multiply each distribution in \py{dice} by the corresponding probabilities in \py{pmf_dice}.
3543
To express this computation concisely, it is convenient to put the distributions into a Pandas DataFrame:
3544

3545
\begin{code}
3546
pd.DataFrame(dice)
3547
\end{code}
3548

3549
The result is a DataFrame with one row for each distribution and one column for each possible outcome.
3550
Not all rows are the same length, so Pandas fills the extra spaces with the special value \py{NaN}, which stands for ``not a number''.
3551
We can use `fillna` to replace the \py{NaN} values with 0.
3552

3553
\begin{code}
3554
pd.DataFrame(dice).fillna(0)
3555
\end{code}
3556

3557
Before we multiply by the probabilities in \py{pmf_dice}, we have to transpose the matrix so the distributions run down the columns rather than across the rows:
3558

3559
\begin{code}
3560
df = pd.DataFrame(dice).fillna(0).transpose()
3561
\end{code}
3562

3563
Now we can multiply by the probabilities:
3564

3565
\begin{code}
3566
df *= pmf_dice.ps
3567
\end{code}
3568

3569
And add up the weighted distributions:
3570

3571
\begin{code}
3572
total = df.sum(axis=1)
3573
\end{code}
3574

3575
The argument \py{axis=1} means we want to sum across the rows.
3576
The result is a Pandas Series.
3577

3578
Putting it all together, here's a function that makes a weighted mixture of distributions.
3579

3580
\begin{code}
3581
def make_mixture(pmf, pmf_seq):
3582
    df = pd.DataFrame(pmf_seq).fillna(0).transpose()
3583
    df *= pmf.ps
3584
    total = df.sum(axis=1)
3585
    return Pmf(total)
3586
\end{code}
3587

3588
%TODO: Add make_mixture to empiricaldist
3589

3590
The first parameter is a \py{Pmf} that makes from each hypothesis to a probability.
3591
The second parameter is a sequence of \py{Pmf} objects, one for each hypothesis.
3592
We can call it like this:
3593

3594
\begin{code}
3595
mix = make_mixture(pmf_dice, dice)
3596
\end{code}
3597

3598
Figure~\ref{fig06-04} shows the result, which is a mixture of uniform distributions.
3599

3600
\begin{figure}
3601
% chap06soln.ipynb
3602
\centerline{\includegraphics[width=4in]{figs/fig06-04.pdf}}
3603
\caption{Mixture of uniform distributions from three kinds of dice.}
3604
\label{fig06-04}
3605
\end{figure}
3606

3607

3608

3609
\section{Summary}
3610

3611

3612
We have seen two representations of distributions: Pmfs and Cdfs.
3613
These representations are equivalent in the sense that they contain
3614
the same information, so you can convert from one to the other.  The
3615
primary difference between them is performance: some operations are
3616
faster and easier with a Pmf; others are faster with a Cdf.
3617
\index{Pmf} \index{Cdf}
3618

3619

3620
In this chapter we used `Cdf` objects to compute distributions of maxima and minima; these distributions are useful for inference if we are given a maximum or minimum as data.
3621

3622
We also computed mixtures of distributions, which we will use in the next chapter to make predictions.
3623

3624

3625
\section{Exercises}
3626

3627
The code for this chapter is in \py{chap06.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
3628
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap06.ipynb}.
3629

3630
The notebook provides space where you can work on the following problems.
3631

3632

3633
\begin{exercise}
3634
When you generate a {\it Dungeons~\&~Dragons} character, instead of rolling dice, you can use the "standard array" of attributes, which is 15, 14, 13, 12, 10, and 8.
3635

3636
Do you think you are better off using the standard array or (literally) rolling the dice?
3637

3638
Compare the distribution of the values in the standard array to the distribution we computed for the best three out of four:
3639

3640
\begin{itemize}
3641

3642
\item Which distribution has higher mean?  Use the \py{mean} method.
3643

3644
\item Which distribution has higher standard deviation?  Use the \py{std} method.
3645

3646
\item The lowest value in the standard array is 8.  For each attribute, what is the probability of getting a value less than 8?  If you roll the dice six times, what's the probability that at least one of your attributes is less than 8?
3647

3648
\item The highest value in the standard array is 15.  For each attribute, what is the probability of getting a value greater than 15?  If you roll the dice six times, what's the probability that at least one of your attributes is greater than 15?
3649

3650
\end{itemize}
3651

3652
\end{exercise}
3653

3654

3655
\begin{exercise}
3656
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
3657
I choose one of the dice at random, roll it, and report that the outcome is a 1.
3658
If I roll the same die again, what is the probability that I get another 1?
3659

3660
Hint: Compute the posterior distribution as we have done before and pass it as one of the arguments to \py{make_mixture}.
3661
\end{exercise}
3662

3663

3664
\begin{exercise}
3665
Suppose I have two boxes of dice:
3666

3667
\begin{itemize}
3668
\item One contains a 4-sided die and a 6-sided die.
3669

3670
\item The other contains a 6-sided die and an 8-sided die.
3671
\end{itemize}
3672

3673
I choose a box at random, choose a die, and roll it 3 times.  If I get 2, 4, and 6, which box do you think I chose?
3674
\end{exercise}
3675

3676

3677
\newcommand{\Poincare}{Poincar\'{e}}
3678

3679
\begin{exercise}
3680
Henri \Poincare~was a French mathematician who taught at the Sorbonne around 1900. The following anecdote about him is probably fabricated, but it makes an interesting probability problem.
3681

3682
Supposedly \Poincare~suspected that his local bakery was selling loaves of bread that were lighter than the advertised weight of 1 kg, so every day for a year he bought a loaf of bread, brought it home and weighed it. At the end of the year, he plotted the distribution of his measurements and showed that it fit a normal distribution with mean 950 g and standard deviation 50 g. He brought this evidence to the bread police, who gave the baker a warning.
3683

3684
For the next year, \Poincare~continued the practice of weighing his bread every day. At the end of the year, he found that the average weight was 1000 g, just as it should be, but again he complained to the bread police, and this time they fined the baker.
3685

3686
Why? Because the shape of the distribution was asymmetric. Unlike the normal distribution, it was skewed to the right, which is consistent with the hypothesis that the baker was still making 950 g loaves, but deliberately giving \Poincare~the heavier ones.
3687

3688
To see whether this anecdote is plausible, let's suppose that when the baker sees \Poincare~coming, he hefts \py{n} loaves of bread and gives \Poincare~the heaviest one.  How many loaves would the baker have to heft to make the average of the maximum 1000 g?
3689
\end{exercise}
3690

3691

3692
\begin{exercise}
3693
Two doctors fresh out of medical school are arguing about whose hospital delivers more babies.  The first doctor say, ``I've been at Hospital A for two weeks, and already we've had a day when we delivered 20 babies.''
3694

3695
The second doctor says, ``I've only been at Hospital B for one week, but already there's been a 19-baby day.''
3696

3697
Which hospital do you think delivers more babies on average?  You can assume that the number of babies born in a day is well modeled by a Poisson distribution with parameter $\lambda$ (see \url{https://en.wikipedia.org/wiki/Poisson_distribution}).
3698

3699
\end{exercise}
3700

3701

3702
\begin{exercise}
3703
This question is related to a method I developed for estimating the minimum time for a packet of data to travel through a path in the internet.
3704

3705
Suppose I drive the same route three times and the fastest of the three attempts takes 8 minutes.
3706

3707
There are two traffic lights on the route. As I approach each light, there is a 40\% chance that it is green; in that case, it causes no delay. And there is a 60\% change it is red; in that case it causes a delay that is uniformly distributed from 0 to 60 seconds.
3708

3709
What is the posterior distribution of the time it would take to drive the route with no delays?
3710
\end{exercise}
3711

3712

3713

3714
\chapter{Poisson Processes}
3715
\label{prediction}
3716

3717
\newcommand{\lam}{\mathtt{\lambda}}
3718

3719
\section{The World Cup Problem}
3720

3721
In the 2018 FIFA World Cup final, France defeated Croatia 4 goals to 2.  Based on this outcome:
3722

3723
\begin{enumerate}
3724

3725
\item How confident should we be that France is the better team?
3726

3727
\item If the same teams played again, what is the chance France would win again?
3728

3729
\end{enumerate}
3730

3731

3732
To answer these questions, we have to make some modeling decisions.
3733

3734
First, I'll assume that for any team against any other team there is some unknown goal-scoring rate, measured in goals per game, which I'll denote
3735
$\lam$.
3736

3737
Second, I'll assume that a goal is equally likely during any minute of a game.  So, in a 90 minute game, the probability of scoring during any minute is $\lam / 90$.
3738

3739
Third, I'll assume that a team never scores twice during the same minute.
3740

3741
Of course, none of these assumptions is absolutely true in the real world, but I think they are reasonable simplifications, and as we will see, they allow use to derive some useful results.
3742
As George Box said, ``All models are wrong; some are useful''
3743
(see \url{https://en.wikipedia.org/wiki/All_models_are_wrong}).
3744

3745
My strategy for answering this question is
3746

3747
\begin{enumerate}
3748

3749
\item Use statistics from previous games to choose a prior
3750
distribution for $\lam$.
3751

3752
\item Use the score from the game to estimate $\lam$ for each team.
3753

3754
\item Use the posterior distributions of $\lam$ to compute
3755
distribution of goals for each team and the probability that each team wins
3756
the next game.
3757

3758
\end{enumerate}
3759

3760
\section{Poisson processes}
3761

3762
In mathematical statistics, a {\bf process} is a stochastic model of a
3763
physical system (``stochastic'' means that the model has some kind of
3764
randomness in it).
3765

3766
For example, a {\bf Bernoulli process} is a model of a
3767
sequence of events, called trials, in which each trial has two
3768
possible outcomes, usually called success and failure.
3769
So a Bernoulli process
3770
is a natural model for a series of coin flips, or a series of shots on
3771
goal.
3772
\index{process}
3773
\index{Bernoulli process}
3774

3775
A {\bf Poisson process} is the continuous version of a Bernoulli process,
3776
where an event can occur at any point in time with equal probability.
3777
Poisson processes can be used to model customers arriving in a store,
3778
buses arriving at a bus stop, or goals scored in a soccer game.
3779
\index{Poisson process}
3780

3781
In many real systems the probability of an event changes over time.
3782
Customers are more likely to go to a store at certain times of day,
3783
buses are supposed to arrive at fixed intervals, and goals are more
3784
or less likely at different times during a game.
3785

3786
But all models are based on simplifications, and in this case modeling
3787
a soccer game with a Poisson process is a reasonable choice.  Heuer,
3788
M\"{u}ller and Rubner (2010) analyze scoring in a German soccer league
3789
and come to the same conclusion (see
3790
\url{http://www.cimat.mx/Eventos/vpec10/img/poisson.pdf}).
3791

3792
The benefit of using this model is that we can compute the distribution
3793
of goals per game efficiently, as well as the distribution of time
3794
between goals.  Specifically, if the average number of goals
3795
in a game is $\lam$, the distribution of goals per game is
3796
given by the Poisson PMF:
3797
\index{Poisson distribution}
3798
%
3799
\[ f(k; \lam) = \lam^k \exp(-\lam) ~/~ k! \]
3800
%
3801
And the distribution of time between goals is given by the
3802
exponential PDF:
3803
\index{exponential distribution}
3804
%
3805
\[ f(t; \lam) = \lam \exp(-\lam t) \]
3806
%
3807
Let's start with the Poisson distribution.
3808

3809

3810
\section{The Poisson Distribution}
3811

3812
Suppose we know that the goal-scoring rate for one team against another is $\lam = 1.4$ goals per game.
3813
The following function computes the Poisson distribution of \py{k}, the number of goals the team scores in one game.
3814

3815
\begin{code}
3816
from scipy.stats import poisson
3817

3818
def make_poisson_pmf($\lam$, high):
3819
    qs = np.arange(high)
3820
    ps = poisson.pmf(qs, $\lam$)
3821
    pmf = Pmf(ps, qs)
3822
    pmf.normalize()
3823
    return pmf
3824
\end{code}
3825

3826
The first parameter is the goal-scoring rate.
3827
The second is the upper bound of the distribution.
3828
In theory the Poisson distribution goes to infinity, but we can cut if off when we get to quantities with negligible probability.
3829

3830
As usual, the \py{qs} are the quantities in the distribution and the \py{ps} are their probabilities.
3831
SciPy provides \py{poisson}, which has a function called \py{pmf} that evaluates the PMF of the Poisson distribution.
3832

3833
The return value is a normalized \py{Pmf}.
3834
We can call \py{make_poisson_pmf} like this:
3835

3836
\begin{code}
3837
pmf_goals = make_poisson_pmf($\lam$=1.4, high=10)
3838
\end{code}
3839

3840

3841
\begin{figure}
3842
% chap07soln.ipynb
3843
\centerline{\includegraphics[width=4in]{figs/fig07-01.pdf}}
3844
\caption{Poisson distribution with $\lam=1.4$.}
3845
\label{fig07-01}
3846
\end{figure}
3847

3848
Figure~\ref{fig07-01} shows the result, a Poisson distribution with $\lam=1.4$.
3849
The most likely outcomes are 0, 1, and 2; higher values are possible but increasingly unlikely.
3850
Values above 7 are negligible.
3851

3852
If we know the goal scoring rate, we can predict the number of goals.
3853
Now let's turn it around: given a number of goals, what can we say about the goal-scoring rate?
3854

3855
To answer that, we need to think about the prior distribution of $\lam$.
3856
And for that, I am going to use a Gamma distribution.
3857

3858

3859
\section{The Gamma Distribution}
3860

3861
If you have ever seen a soccer game, you have some information about $\lam$.
3862
In most games, teams score a few goals each.
3863
In rare cases, a team might score more than 5 goals, but they almost never score more than 10.
3864

3865
Using data from previous World Cups
3866
I estimate that each team scores about 1.4 goals per game, on average (see \url{https://www.statista.com/statistics/269031/goals-scored-per-game-at-the-fifa-world-cup-since-1930/}).  So I'll set the mean of $\lam$ to be 1.4.
3867

3868
For a good team against a bad one, we expect $\lam$ to be higher; for a bad team against a good one, we expect it to be lower.
3869

3870
To model the distribution of goal-scoring rates, I will use a gamma distribution, which I chose because:
3871

3872
\begin{enumerate}
3873

3874
\item The goal scoring rate is a continuous quantity that cannot be less than 0; the gamma distribution is appropriate for this kind of quantity.
3875

3876
\item The gamma distribution has only one parameter, $\alpha$, which is the mean.  So it's easy to construct a gamma distribution with the mean we want.
3877

3878
\item As we'll see, the shape of the Gamma distribution is a reasonable choice, given what we know about soccer.
3879

3880
\end{enumerate}
3881

3882
For more about the gamma distribution, see \url{https://en.wikipedia.org/wiki/Gamma_distribution}.
3883

3884
The gamma distribution is continuous, but we'll approximate it with a discrete \py{Pmf}.
3885
SciPy provides \py{gamma}, which provides \py{pdf}, which evaluates the {\bf probability density function} (PDF) of the gamma distribution.
3886

3887
\newcommand{\alf}{\mathtt{\alpha}}
3888

3889
\begin{code}
3890
from scipy.stats import gamma
3891

3892
$\alf$ = 1.4
3893
qs = np.linspace(0, 10, 101)
3894
ps = gamma.pdf(qs, $\alf$)
3895
\end{code}
3896

3897
The \py{qs} are possible values of $\lam$ from 0 to 10.
3898
The \py{ps} are probability densities, which we can think of as unnormalized probabilities.
3899
If we put the densities in a \py{Pmf} and normalize them, like this:
3900

3901
\begin{code}
3902
prior = Pmf(ps, qs)
3903
prior.normalize()
3904
\end{code}
3905

3906
The result is a discrete approximation of a continuous distribution.
3907
Figure~\ref{fig07-02} shows what it looks like.
3908

3909
\begin{figure}
3910
% chap07soln.ipynb
3911
\centerline{\includegraphics[width=4in]{figs/fig07-02.pdf}}
3912
\caption{A gamma prior distribution of goal-scoring rate.}
3913
\label{fig07-02}
3914
\end{figure}
3915

3916
This distribution represents our prior knowledge about goal scoring: $\lam$ is usually less than 2, occasionally as high as 6, and seldom higher than that.  And the mean is about 1.4.
3917

3918
As usual, reasonable people could disagree about the details of the prior, but this is good enough to get started.
3919
Let's do an update.
3920

3921

3922

3923
\section{Update}
3924

3925
Now that we have a prior, the next step is to compute the likelihood of the data.
3926
For France, the data is the number of goals scored, 4.
3927
We can use the Poisson distribution to compute the likelihoods:
3928

3929
\begin{code}
3930
$\lam$s = prior.qs
3931
k = 4
3932
likelihood = poisson.pmf(k, $\lam$s)
3933
\end{code}
3934

3935
The result is a NumPy array with the likelihood of the data for each hypothetical value of $\lam$.
3936
So we can do the update like this:
3937

3938
\begin{code}
3939
def update_poisson(pmf, data):
3940
    k = data
3941
    $\lam$s = pmf.qs
3942
    likelihood = poisson.pmf(k, $\lam$s)
3943
    pmf *= likelihood
3944
    pmf.normalize()
3945
\end{code}
3946

3947
The first parameter is the prior; the second is the number of goals.
3948
We can use this function to compute posterior distributions for France and Croatia:
3949

3950
\begin{code}
3951
france = prior.copy()
3952
update_poisson(france, 4)
3953

3954
croatia = prior.copy()
3955
update_poisson(croatia, 2)
3956
\end{code}
3957

3958
Figure~\ref{fig07-03} shows the results.
3959

3960
\begin{figure}
3961
% chap07soln.ipynb
3962
\centerline{\includegraphics[width=4in]{figs/fig07-03.pdf}}
3963
\caption{}
3964
\label{fig07-03}
3965
\end{figure}
3966

3967
Recall that the mean of the prior distribution is 1.4.
3968
After Croatia scores 2 goals, their posterior mean is 1.7, which is near the midpoint of the prior and the date.
3969
Likewise after France scores 4 goals, their posterior mean is 2.7.
3970

3971
These results are typical of a Bayesian update: the location of the posterior distribution is a compromise between the prior and the data.
3972

3973

3974
\section{Probability of Superiority}
3975

3976
Now that we have a posterior distribution for each team, we can answer the first question: How confident should we be that France is the better team?
3977

3978
In the model, ``better'' means having a higher goal-scoring rate against the opponent.
3979
We can use the posterior distributions to compute the probability that a random value drawn from France's distribution exceeds a value drawn from Croatia's.
3980

3981
One way to do that is to enumerate all pairs of values from the two distributions, adding up the total probability that one value exceeds the other, as in this function:
3982

3983
\begin{code}
3984
def prob_gt(pmf1, pmf2):
3985
    total = 0
3986
    for q1, p1 in pmf1.items():
3987
        for q2, p2 in pmf2.items():
3988
            if q1 > q2:
3989
                total += p1 * p2
3990
    return total
3991
\end{code}
3992

3993
This is similar to the method we use in Section~\ref{addends} to compute the distribution of a sum.
3994
Here's how we use it:
3995

3996
\begin{code}
3997
prob_gt(france, croatia)
3998
\end{code}
3999

4000
\py{Pmf} provides a function that does the same thing, which we can call like this:
4001

4002
\begin{code}
4003
Pmf.prob_gt(france, croatia)
4004
\end{code}
4005

4006
The result is close to 75\%.  So, on the basis of this game, we are reasonably confident that France is the better team.
4007

4008
Of course, we should remember that this result is based on the assumption that the goal-scoring rate is constant.
4009
In reality, if a team is down by one goal, they might play more aggressively toward the end of the game, making them more likely to score, but also more likely to give up an additional goal.
4010

4011
As always, the results are only as good as the model.
4012

4013

4014
\section{The distribution of goals}
4015

4016
Now we can take on the second question: If the same teams played again, what is the chance France would win the rematch?
4017

4018
To answer this question, we'll generate a {\bf posterior predictive distribution} for each team, which is the number of goals we expect them to score.
4019

4020
If we knew the goal scoring rate, $\lam$, the distribution of goals would be a Poisson distribution with parameter $\lam$.
4021

4022
Since we don't know $\lam$, the distribution of goals is a mixture of a Poisson distributions with different values of $\lam$.
4023

4024
First I'll generate a sequence of Poisson distributions, one for each hypothetical value of $\lam$:
4025

4026
\begin{code}
4027
pmf_seq = [make_poisson_pmf($\lam$, 12) for $\lam$ in prior.qs]
4028
\end{code}
4029

4030
Now we can use \py{make_mixture} from Section~\ref{mixture} to compute posterior predictive distributions for France and Croatia:
4031

4032
\begin{code}
4033
pred_france = make_mixture(france, pmf_seq)
4034
pred_croatia = make_mixture(croatia, pmf_seq)
4035
\end{code}
4036

4037
Figure~\ref{fig07-04} shows posterior predictive distributions for the number of goals in a rematch.
4038

4039
\begin{figure}
4040
% chap07soln.ipynb
4041
\centerline{\includegraphics[width=5.5in]{figs/fig07-04.pdf}}
4042
\caption{Posterior predictive distributions for the number of goals in a rematch.}
4043
\label{fig07-04}
4044
\end{figure}
4045

4046
These distributions represent two sources of uncertainty: we don't know the actual value of $\lam$, and even if we did, we would not know the number of goals in the next game.
4047

4048
We can use these distributions to compute the probability that France wins, loses, or ties the rematch:
4049

4050
\begin{code}
4051
win = Pmf.prob_gt(pred_france, pred_croatia)
4052
lose = Pmf.prob_lt(pred_france, pred_croatia)
4053
tie = Pmf.prob_eq(pred_france, pred_croatia)
4054
\end{code}
4055

4056
Assuming that France wins half of the ties, their chance of winning the rematch is about 65\%.
4057
This is a bit lower than their probability of superiority, which is 75\%. And that makes sense even if they are better team, they might lose the game.
4058

4059

4060
\section{The Exponential Distribution}
4061
\label{exponential}
4062

4063
As an exercise at the end of this chapter, you'll have a chance to work on  this variation on the World Cup Problem:
4064

4065
\begin{quote}
4066
In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark.
4067
At that point in the match, how many goals would you expect Germany to score after 90 minutes?
4068
What was the probability that they would score 5 more goals (as, in fact, they did)?
4069
\end{quote}
4070

4071
In this version, notice that the data is not the number of goals in a fixed period of time but the time between goals.
4072

4073
To compute the likelihood of data like this, we can use the theory of Poisson processes again.
4074
In our model of a soccer game, we assume that each team has a goal-scoring rate, $\lam$, in goals per game.
4075
And we assume that $\lam$ is constant, so the chance of scoring a goal in the same at any moment of the game.
4076

4077
Under these assumptions, the time between goals follows an exponential distribution (see \url{https://en.wikipedia.org/wiki/Exponential_distribution}).
4078
If the goal-scoring rate is $\lam$, the probability of seeing an interval between goals of $t$ is proportional to the PDF of the exponential distribution:
4079

4080
$f(t; \lam) = \lam \exp(-\lam t)$
4081

4082
Because $t$ is a continuous quantity, the value of this expression is not a probability; it is a probability density.
4083
However, it is proportional to the probability of the data, so we can use it as a likelihood in a Bayesian update.
4084

4085
The following function computes this PDF:
4086

4087
\begin{code}
4088
def expo_pdf(t, $\lam$):
4089
    return $\lam$ * np.exp(-$\lam$ * t)
4090
\end{code}
4091

4092
To see what exponential distributions look like, let's assume again that $\lam$ is 1.4; we can compute the distribution of $t$ like this:
4093

4094
\begin{code}
4095
$\lam$ = 1.4
4096
qs = np.linspace(0, 4, 101)
4097
ps = expo_pdf(qs, $\lam$)
4098
pmf_time = Pmf(ps, qs)
4099
pmf_time.normalize()
4100
\end{code}
4101

4102
\begin{figure}
4103
% chap01soln.ipynb
4104
\centerline{\includegraphics[width=4in]{figs/fig07-05.pdf}}
4105
\caption{An exponential distribution with $\lam = 1.4$.}
4106
\label{fig07-05}
4107
\end{figure}
4108

4109
Figure~\ref{fig07-05} shows the result.
4110
It is counterintuitive, but true, that the most likely time to score a goal is immediately.  After that, the probability of each possible interval is a little lower.
4111

4112
With a goal-scoring rate of 1.4, it is possible that a team will take more than one game to score a goal, but it is unlikely that they will take more than two games.
4113

4114

4115
\section{Summary}
4116

4117

4118

4119
\section{Exercises}
4120

4121
The code for this chapter is in \py{chap07.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
4122
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap07.ipynb}.
4123

4124
The notebook provides space where you can work on the following problems.
4125

4126

4127
\begin{exercise}
4128
Finish off the exercise from Section~\ref{exponential}:
4129

4130
\begin{quote}
4131
In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark.
4132
At that point in the match, how many goals would you expect Germany to score after 90 minutes?
4133
What was the probability that they would score 5 more goals (as, in fact, they did)?
4134
\end{quote}
4135

4136
\end{exercise}
4137

4138
\begin{exercise}
4139
\end{exercise}
4140

4141

4142

4143
\begin{exercise}
4144
In the 2010-11 National Hockey League (NHL) Finals, my beloved Boston
4145
Bruins played a best-of-seven championship series against the despised
4146
Vancouver Canucks.  Boston lost the first two games 0-1 and 2-3, then
4147
won the next two games 8-1 and 4-0.  At this point in the series, what
4148
is the probability that Boston will win the next game, and what is
4149
their probability of winning the championship?
4150

4151
To choose a prior distribution, I got some statistics from
4152
\url{http://www.nhl.com}, specifically the average goals per game
4153
for each team in the 2010-11 season.  The distribution well modeled by a gamma distribution with mean 2.8.
4154
\index{National Hockey League}
4155
\index{NHL}
4156
\index{hockey}
4157
\index{Boston Bruins}
4158
\index{Vancouver Canucks}
4159
\end{exercise}
4160

4161

4162

4163

4164

4165
\begin{exercise}
4166

4167
If buses arrive at a bus stop every 20 minutes, and you
4168
arrive at the bus stop at a random time, your wait time until
4169
the bus arrives is uniformly distributed from 0 to 20 minutes.
4170
\index{bus stop problem}
4171

4172
But in reality, there is variability in the time between
4173
buses.  Suppose you are waiting for a bus, and you know the historical
4174
distribution of time between buses.  Compute your distribution
4175
of wait times.
4176

4177
Hint: Suppose that the time between buses is either
4178
5 or 10 minutes with equal probability.  What is the probability
4179
that you arrive during one of the 10 minute intervals?
4180

4181
I solve a version of this problem in the next chapter.
4182

4183
\end{exercise}
4184

4185

4186
\begin{exercise}
4187

4188
Suppose that passengers arriving at the bus stop are well-modeled
4189
by a Poisson process with parameter $\lam$.  If you arrive at the
4190
stop and find 3 people waiting, what is your posterior distribution
4191
for the time since the last bus arrived.
4192
\index{Poisson process}
4193
\index{bus stop problem}
4194

4195
I solve a version of this problem in the next chapter.
4196

4197
\end{exercise}
4198

4199

4200
\begin{exercise}
4201

4202
Suppose that you are an ecologist sampling the insect population in
4203
a new environment.  You deploy 100 traps in a test area and come back
4204
the next day to check on them.  You find that 37 traps have been
4205
triggered, trapping an insect inside.  Once a trap triggers, it
4206
cannot trap another insect until it has been reset.
4207
\index{insect sampling problem}
4208

4209
If you reset the traps and come back in two days, how many traps
4210
do you expect to find triggered?  Compute a posterior predictive
4211
distribution for the number of traps.
4212
\index{predictive distribution}
4213

4214
\end{exercise}
4215

4216

4217
\begin{exercise}
4218

4219
Suppose you are the manager of an apartment building with
4220
100 light bulbs in common areas.  It is your responsibility
4221
to replace light bulbs when they break.
4222
\index{light bulb problem}
4223

4224
On January 1, all 100 bulbs are working.  When you inspect
4225
them on February 1, you find 3 light bulbs out.  If you
4226
come back on April 1, how many light bulbs do you expect to
4227
find broken?
4228

4229
In the previous exercise, you could reasonably assume that an event is
4230
equally likely at any time.  For light bulbs, the likelihood of
4231
failure depends on the age of the bulb.  Specifically, old bulbs
4232
have an increasing failure rate due to evaporation of the filament.
4233

4234
This problem is more open-ended than some; you will have to make
4235
modeling decisions.  You might want to read about the Weibull
4236
distribution
4237
(\url{http://en.wikipedia.org/wiki/Weibull_distribution}).
4238
Or you might want to look around for information about
4239
light bulb survival curves.
4240
\index{Weibull distribution}
4241

4242
\end{exercise}
4243

4244

4245

4246
\chapter{Decision Analysis}
4247
\label{decisionanalysis}
4248

4249
In this chapter....
4250

4251
... we estimate the price of prizes on a game show.
4252
Once we compute a posterior distribution, we'll use it to optimize a decision-making process.
4253

4254
This example demonstrates the real power of Bayesian methods, not just computing posterior distributions, but using them to make better decisions.
4255

4256

4257
\section{The {\it Price is Right} problem}
4258

4259
On November 1, 2007, contestants named Letia and Nathaniel appeared
4260
on {\it The Price is Right}, an American game show.  They competed in
4261
a game called {\it The Showcase}, where the objective is to guess the price
4262
of a showcase of prizes.  The contestant who comes closest to the
4263
actual price of the showcase, without going over, wins the prizes.
4264

4265
\index{Price is Right}
4266
\index{Showcase}
4267

4268
Nathaniel went first.  His showcase included a dishwasher, a wine
4269
cabinet, a laptop computer, and a car.  He bid \$26,000.
4270

4271
Letia's showcase included a pinball machine, a video arcade game, a
4272
pool table, and a cruise of the Bahamas.  She bid \$21,500.
4273

4274
The actual price of Nathaniel's showcase was \$25,347.  His bid
4275
was too high, so he lost.
4276

4277
The actual price of Letia's showcase was \$21,578.  She was only
4278
off by \$78, so she won her showcase and, because
4279
her bid was off by less than \$250, she also won Nathaniel's
4280
showcase.
4281

4282
For a Bayesian thinker, this scenario suggests several questions:
4283

4284
\begin{enumerate}
4285

4286
\item Before seeing the prizes, what prior beliefs should the
4287
  contestant have about the price of the showcase?
4288

4289
\item After seeing the prizes, how should the contestant update
4290
  those beliefs?
4291

4292
\item Based on the posterior distribution, what should the
4293
  contestant bid?
4294

4295
\end{enumerate}
4296

4297
The third question demonstrates a common use of Bayesian analysis:
4298
decision analysis.  Given a posterior distribution, we can choose
4299
the bid that maximizes the contestant's expected return.
4300

4301
\index{decision analysis}
4302

4303
This problem is inspired by an example in Cameron Davidson-Pilon's
4304
book, {\it Probablistic Programming and Bayesian Methods for Hackers}
4305
(see \url{http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers}).
4306

4307
\index{Davidson-Pilon, Cameron}
4308

4309

4310
\section{The prior}
4311

4312
To choose a prior distribution of prices, we can take advantage
4313
of data from previous episodes.
4314
Fortunately, fans of the show keep detailed records (see \url{https://web.archive.org/web/20121107204942/http://www.tpirsummaries.8m.com/}).
4315

4316
For this example, I downloaded files containing the price of each showcase from the 2011 and 2012 seasons and the bids offered by the contestants.
4317

4318
This dataset contains the prices for 313 previous showcases, which we can think of as a sample from the population of possible prices.
4319

4320
We can use this sample to estimate the prior distribution of showcase prices.
4321
One way to do that is {\bf kernel density estimation} (KDE), which uses the sample to estimate a smooth distribution.
4322

4323
SciPy provides \py{gaussian_kde}, which takes a sample and returns an object that represents the estimated distribution.
4324
\index{kernel density estimation}
4325
\index{KDE}
4326

4327
The following function takes a sample, makes a KDE, evaluates it at a given sequence of quantities, and returns the result as a normalized \py{Pmf}:
4328

4329
\begin{code}
4330
from scipy.stats import gaussian_kde
4331

4332
def make_kde(qs, sample):
4333
    kde = gaussian_kde(sample)
4334
    ps = kde(qs)
4335
    pmf = Pmf(ps, qs)
4336
    pmf.normalize()
4337
    return pmf
4338
\end{code}
4339

4340
We can use it to estimate the distribution of total price for Showcase 1:
4341

4342
\begin{code}
4343
qs = np.linspace(0, 80000, 81)
4344
prior1 = make_kde(qs, df['Showcase 1'])
4345
\end{code}
4346

4347
\begin{figure}
4348
% chap08soln.ipynb
4349
\centerline{\includegraphics[width=4in]{figs/fig08-01.pdf}}
4350
\caption{Distribution of total price for Showcase 1}
4351
\label{fig08-01}
4352
\end{figure}
4353

4354
Figure~\ref{fig08-01} shows the estimated distribution.
4355
The most common price is around
4356
\$28,000, but there might be a second mode near \$50,000.
4357

4358
If you were a contestant on the
4359
show, you could use this distribution to quantify your prior belief
4360
about the price of each showcase (before you see the prizes).
4361

4362

4363
here is the PDF of a Gaussian distribution with
4364
mean 0 and standard deviation 1:
4365
%
4366
\[ f(x) = \frac{1}{\sqrt{2 \pi}} \exp(-x^2/2) \]
4367
%
4368

4369

4370

4371

4372

4373
\section{Modeling the contestants}
4374

4375
When the contestants see the prizes, they get information they can use to update their beliefs.
4376
To do that, we have to answer these questions:
4377

4378
\begin{enumerate}
4379

4380
\item What data should we consider and how should we quantify it?
4381

4382
\item Can we compute a likelihood function; that is,
4383
  for each hypothetical value of \py{price}, can we compute
4384
  the conditional likelihood of the data?
4385

4386
\end{enumerate}
4387

4388
To answer these questions, I model the contestant
4389
as a price-guessing instrument with known error characteristics.
4390
In other words, when the contestant sees the prizes, they
4391
guess the price of each prize---ideally without taking into
4392
consideration the fact that the prize is part of a showcase---and
4393
add up the prices.  Let's call this total \py{guess}.
4394
\index{error}
4395

4396
Under this model, the question we have to answer is, ``If the
4397
actual price is \py{price}, what is the likelihood that the
4398
contestant's estimate would be \py{guess}?''
4399
\index{likelihood}
4400

4401
Or if we define \py{error = price - guess}, we can ask, ``What is the likelihood that the contestant's estimate is off by \py{error}?''
4402

4403
To answer this question, I'll use the historical data again.
4404
For each showcase in the dataset, let's look at the difference between the contestant's bid and the actual price:
4405

4406
\begin{code}
4407
sample_diff1 = df['Bid 1'] - df['Showcase 1']
4408
sample_diff2 = df['Bid 2'] - df['Showcase 2']
4409
\end{code}
4410

4411
To visualize the distribution of these differences, we can use KDE again.
4412

4413
\begin{code}
4414
qs = np.linspace(-40000, 20000, 61)
4415
kde_diff1 = make_kde(qs, sample_diff1)
4416
kde_diff2 = make_kde(qs, sample_diff2)
4417
\end{code}
4418

4419
\begin{figure}
4420
% chap08soln.ipynb
4421
\centerline{\includegraphics[width=4in]{figs/fig08-02.pdf}}
4422
\caption{Distribution of differences for the two contestants.}
4423
\label{fig08-02}
4424
\end{figure}
4425

4426
Figure~\ref{fig08-02} shows the results.
4427

4428
It looks like the bids are too low more often than too high, which makes sense.
4429
Remember that under the rules of the game, you lose if you overbid, so contestants probably underbid to some degree deliberately.
4430

4431
We can use the observed distribution of differences to model the contestant's distribution of errors.
4432
This step is a little tricky because we don't actually know the contestant's guesses; we only know what they bid.
4433
So we have to make some assumptions:
4434

4435
\begin{enumerate}
4436

4437
\item I'll assume that contestants underbid because they are being strategic, and that on average their guesses are accurate.  In other words, the mean of their errors is 0.
4438

4439
\item But I'll assume that the spread of the differences reflects the actual spread of their errors.  So, I'll use the standard deviation of the differences as the standard deviation of their errors.
4440

4441
\end{enumerate}
4442

4443
Based on these assumptions, I'll make a normal distribution with mean 0 and standard deviation \py{std_diff1}:
4444

4445
\begin{code}
4446
from scipy.stats import norm
4447

4448
error_dist1 = norm(0, std_diff1)
4449
\end{code}
4450

4451
The result is an object that represents the distribution of errors for Player 1.
4452
Among other things, this object can compute the PDF of a normal distribution, which we will use in the next section.
4453

4454
\index{normal distribution}
4455

4456
This model is not perfect because contestants' bids are sometimes strategic; for example, if Player 2 thinks that Player 1
4457
has overbid, Player 2 might make a very low bid.
4458
In that case \py{diff} does not reflect \py{error}.
4459
If this happens a lot, the observed variance in \py{diff} might overestimate the variance in \py{error}.
4460
Nevertheless, I think it is a reasonable modeling decision.
4461

4462
As an alternative, someone preparing to appear on the show could
4463
estimate their own distribution of \py{error} by watching previous shows
4464
and recording their guesses and the actual prices.
4465

4466

4467
\section{Update}
4468

4469
Now we are ready to do the update.
4470

4471
Suppose you are Player 1.  You see the prizes in your showcase and your estimate of the total price is \$23,000.
4472

4473
For each hypothetical price in the prior distribution, I'll subtract away your guess.
4474
The result is your error under each hypothesis.
4475

4476
\begin{code}
4477
guess1 = 23000
4478
qs = prior1.index
4479
error1 = guess1 - qs
4480
\end{code}
4481

4482
Now suppose you know based on past performance that your estimation error is well modeled by \py{error_dist1}.
4483

4484
Under that assumption we can compute the likelihood of your estimate under each hypothesis.
4485

4486
\begin{code}
4487
likelihood1 = error_dist1.pdf(error1)
4488
\end{code}
4489

4490
And we can use that likelihood to update the prior.
4491

4492
\begin{code}
4493
posterior1 = prior1 * likelihood1
4494
posterior1.normalize()
4495
\end{code}
4496

4497
Figure~\ref{fig08-03} shows this posterior distribution along with the prior.
4498
Because your estimate is in the lower end of the range, the posterior distribution has shifted to the left.
4499

4500
\begin{figure}
4501
% chap08soln.ipynb
4502
\centerline{\includegraphics[width=4in]{figs/fig08-03.pdf}}
4503
\caption{Prior and posterior distributions for Player 1.}
4504
\label{fig08-03}
4505
\end{figure}
4506

4507
Based on the prior mean, before you saw the prizes you expected to see a showcase with a value close to \$30,000.
4508

4509
After making an estimate of \$23,000, you updated the prior distribution.
4510
Based on the combination of the prior and your estimate, you now expect the actual price to be about \$26,000.
4511

4512
On one level, this result makes sense.
4513
The posterior mean is near the midpoint of your estimate and the prior mean.
4514

4515
On another level, you might find this result strange because it
4516
suggests that if you {\em think} the price is \$23,000, then you
4517
should {\em believe} the price is \$26,000.
4518

4519
To resolve this apparent paradox, remember that you are combining two
4520
sources of information, historical data about past showcases and
4521
guesses about the prizes you see.
4522

4523
We are treating the historical data as the prior and updating it
4524
based on your guesses, but we could equivalently use your guess
4525
as a prior and update it based on historical data.
4526

4527
If you think of it that way, maybe it is less surprising that the
4528
most likely value in the posterior is not your original guess.
4529

4530
\section{Strategy}
4531

4532
Now that we have a posterior distribution, let's think about strategy.
4533

4534
%TODO: Outline of the sections that follow
4535

4536

4537

4538
\section{Probability of Winning}
4539

4540
First, from the point of view of Player 1, let's compute the probability that Player 2 overbids.
4541
To keep it simple, I'll use only the performance of past players, ignoring the estimated price of the showcase.
4542

4543
The following function takes a sequence of past bids and returns the fraction that overbid.
4544

4545
\begin{code}
4546
def prob_overbid(sample_diff):
4547
    return np.mean(sample_diff > 0)
4548
\end{code}
4549

4550
In the dataset, Player 2 overbids about 30\% of the time.
4551

4552
Now suppose Player 1 underbids by \$5000.
4553
What is the probability that Player 2 underbids by more?
4554

4555
The following function uses past performance to estimate the probability that a player underbids by more than a given amount, \py{diff}:
4556

4557
\begin{code}
4558
def prob_worse_than(diff, sample_diff):
4559
    return np.mean(sample_diff < diff)
4560
\end{code}
4561

4562
Player 2 underbids by more than \$5000 about 40\% of the time.
4563

4564
We can combine these functions to compute the probability that Player 1 wins, given the difference between their bid and the actual price:
4565

4566
\begin{code}
4567
def compute_prob_win(diff, sample_diff):
4568
    # if you overbid you lose
4569
    if diff > 0:
4570
        return 0
4571

4572
    # if the opponent overbids, you win
4573
    p1 = prob_overbid(sample_diff)
4574

4575
    # or of their bid is worse than yours, you win
4576
    p2 = prob_worse_than(diff, sample_diff)
4577
    return p1 + p2
4578
\end{code}
4579

4580
Let's look at this from your point of view as a contestant.
4581
\py{diff} is the difference between your bid and the actual price; if it's greater than 0, you overbid, so you lose.
4582

4583
\py{sample_diff} is a sample of differences for your opponent.
4584
If they overbid (and you didn't) you win.
4585

4586
Otherwise, we have to see whose bid is closer, yours or your opponent's.  If their bid is worse than yours, you win.
4587

4588
As an example, you can call it like this:
4589

4590
\begin{code}
4591
compute_prob_win(-5000, sample_diff2)
4592
\end{code}
4593

4594
If Player 1 underbids by \$5000, their chance of winning is about 67\%.
4595
Now let's look at the probability of winning for a range of possible differences.
4596

4597
\begin{code}
4598
xs = np.linspace(-30000, 5000, 121)
4599
ys = [compute_prob_win(x, sample_diff2) for x in xs]
4600
\end{code}
4601

4602
From the point of view of Player 1, Figure~\ref{fig08-04} shows the probability of winning as a function of the difference between their bid and the actual price.
4603

4604
\begin{figure}
4605
% chap08soln.ipynb
4606
\centerline{\includegraphics[width=4in]{figs/fig08-04.pdf}}
4607
\caption{For Player 1, the probability of winning as a function of the difference between their bid and the actual price.}
4608
\label{fig08-04}
4609
\end{figure}
4610

4611

4612
\section{Decision Analysis}
4613

4614
In the previous section we computed the probability of winning given that we have underbid by a particular amount.
4615

4616
In reality the contestants don't know how much they have underbid by because they don't know the actual price.
4617

4618
But they do have a posterior distribution that represents their beliefs about the actual price, and they can use that to estimate their probability of winning with a given bid.
4619

4620
The following function take a possible bid, a posterior distribution of actual prices, and a sample of differences for the opponent.
4621

4622
\begin{code}
4623
def total_prob_win(bid, posterior, sample_diff):
4624
    total = 0
4625
    for price, prob in posterior.items():
4626
        diff = bid - price
4627
        total += prob * compute_prob_win(diff, sample_diff)
4628
    return total
4629
\end{code}
4630

4631
It loops through the hypothetical prices in the posterior distribution and for each price:
4632

4633
\begin{enumerate}
4634

4635
\item Computes the difference between the bid and the hypothetical price.
4636

4637
\item Computes the probability that the player wins, given that difference.
4638

4639
\item Adds up the weighted sum of the probabilities, where the weights are the probabilities in the posterior distribution.
4640

4641
\end{enumerate}
4642

4643
This loop implements the law of total probability:
4644

4645
\[ \p{win} = \sum_{price} \p{price} ~ \p{win ~|~ price} \]
4646

4647
Now we can loop through a range of possible bids and compute the probability of winning:
4648

4649
\begin{code}
4650
bids = posterior1.index
4651
probs = [total_prob_win(bid, posterior1, sample_diff2)
4652
         for bid in bids]
4653
\end{code}
4654

4655
For Player 1, Figure~\ref{fig08-05} shows the probability of winning as a function of their bid.
4656

4657
\begin{figure}
4658
% chap08soln.ipynb
4659
\centerline{\includegraphics[width=4in]{figs/fig08-05.pdf}}
4660
\caption{For Player 1, the probability of winning as a function of their bid.}
4661
\label{fig08-05}
4662
\end{figure}
4663

4664
Recall that your estimate was \$23,000.
4665

4666
After using your estimate to compute the posterior distribution, the posterior mean is about \$26,000.
4667

4668
But the bid that maximizes your chance of winning is \$21,000; with that bid, the probability of winning is 52\%.
4669

4670

4671
\section{Expected Gain}
4672

4673
In the previous section we computed the bid that maximizes your chance of winning.
4674
And if that's your goal, the bid we computed is optimal.
4675

4676
But winning isn't everything.
4677
Remember that if your bid is off by \$250 or less, you win both showcases.
4678
So it might be a good idea to increase your bid a little: it increases the chance you overbid and lose, but it also increases the chance of winning both showcases.
4679

4680
Let's see how that works out.
4681
The following function computes how much you will win, on average, given your bid, the actual price, and a sample of errors for your opponent.
4682

4683
\begin{code}
4684
def compute_gain(bid, price, sample_diff):
4685
    diff = bid - price
4686
    prob = compute_prob_win(diff, sample_diff)
4687

4688
    # if you are within 250 dollars, you win both showcases
4689
    if -250 <= diff <= 0:
4690
        return 2 * price * prob
4691
    else:
4692
        return price * prob
4693
\end{code}
4694

4695
For simplicity, I assume that both showcases have the same value.
4696
Since the probability of winning both showcases is small, the the effect of this simplification should be small.
4697

4698
As an example, if the actual price is \$35000
4699
and you bid \$30000,
4700
you will win about \$23,600 worth of prizes on average.
4701

4702
In reality we don't know the actual price, but we have a posterior distribution that represents what we know about it.
4703
By averaging over the prices and probabilities in the posterior distribution, we can compute the {\bf expected gain} for a particular bid.
4704

4705
\begin{code}
4706
def expected_gain(bid, posterior, sample_diff):
4707
    total = 0
4708
    for price, prob in posterior.items():
4709
        total += prob * compute_gain(bid, price, sample_diff)
4710
    return total
4711
\end{code}
4712

4713
The first argument is your bid; the second is the posterior distribution that represents your belief about the price of the showcase; and \py{sample_diff} is a sample of differences for your opponent.
4714

4715
For the posterior we computed earlier, based on an estimate of \$23,000,
4716
the expected gain for a bid of \$21,000
4717
is about \$16,900.
4718

4719
But can we do better?
4720
To find out, we can loop through a range of bids and find the one that maximizes expected gain.
4721

4722
\begin{code}
4723
bids = posterior1.index
4724

4725
gains = [expected_gain(bid, posterior1, sample_diff2) for bid in bids]
4726

4727
expected_gain_series = pd.Series(gains, index=bids)
4728
\end{code}
4729

4730
Figure~\ref{fig08-06} shows expected gain for a range of possible bids.
4731

4732
\begin{figure}
4733
% chap08soln.ipynb
4734
\centerline{\includegraphics[width=4in]{figs/fig08-06.pdf}}
4735
\caption{Expected gain for a range of possible bids.}
4736
\label{fig08-06}
4737
\end{figure}
4738

4739
Recall that the estimated value of the prizes is \$23,000 and the bid that maximizes the chance of winning is \$21,000.
4740
The bid that maximizes your expected gain is \$22,000; with that bid, your expected gain is about \$17,400.
4741

4742

4743
\section{Discussion}
4744

4745
One of the features of Bayesian estimation is that the
4746
result comes in the form of a posterior distribution.  Classical
4747
estimation usually generates a single point estimate or a confidence
4748
interval, which is sufficient if estimation is the last step in the
4749
process, but if you want to use an estimate as an input to a
4750
subsequent analysis, point estimates and intervals are often not much
4751
help.
4752
\index{distribution}
4753

4754
In this example, we use the posterior distribution
4755
to compute an optimal bid.  The return on a given bid is asymmetric
4756
and discontinuous (if you overbid, you lose), so it would be hard to
4757
solve this problem analytically.  But it is relatively simple to do
4758
computationally.
4759
\index{decision analysis}
4760

4761
Newcomers to Bayesian thinking are often tempted to summarize the
4762
posterior distribution by computing the mean or the maximum
4763
likelihood estimate.  These summaries can be useful, but if that's
4764
all you need, then you probably don't need Bayesian methods in the
4765
first place.
4766
\index{maximum likelihood}
4767
\index{summary statistic}
4768

4769
Bayesian methods are most useful when you can carry the posterior
4770
distribution into the next step of the analysis to perform some
4771
kind of decision analysis, as we did in this chapter, or some kind of
4772
prediction, as we see in the next chapter.
4773

4774
\section{Exercises}
4775

4776
The code for this chapter is in \py{chap08.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
4777
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap08.ipynb}.
4778

4779
The notebook provides space where you can work on the following problems.
4780

4781
\begin{exercise}
4782
Following the instructions in the notebook, replicate the analysis in this chapter from the point of view of Player 2.
4783
\end{exercise}
4784

4785
\begin{exercise}
4786

4787
This exercise is inspired by a true story.  In 2001 I created Green Tea Press to publish my books, starting with {\tt Think Python}.
4788
I ordered 100 copies from a short-run printer and made the book available for through a distributor.  After the first week, the distributor reported that 12 copies were sold.  Based that report, I thought I would run out of copies in about 8 weeks, so I got ready to order more.  My printer offered me a discount if I ordered more than 1000 copies, so I went a little crazy and ordered 2000 copies.  A few days later, my mother called to tell me that her copies of the book had arrived.  Surprised, I asked how many ``copies''.  She said ten.
4789

4790
It turned out I had sold only two copies to non-relatives.  And it took a lot longer than I expected to sell 2000 copies.
4791

4792
The details of this story are unique, but the general problem is something almost every retailer has to figure out.  Based on past sales, how do you predict future sales?  And based on those predictions, how do you decide how much to order and when?
4793

4794
Often the cost of a bad decision is complicated.  If you place a lot of small orders rather than one big one, your costs are likely to be higher.  If you run out of inventory, you might lose customers.  And if you order too much, you have to pay the various costs of holding inventory.
4795

4796
So, let's solve a version of the problem I faced.  Suppose you start selling books online.  During the first week you sell 12 copies (and let's assume that none of the customers are your mother).  During the second week you sell 8 copies.
4797

4798
Assuming that the arrival of orders is a Poisson process, we can think of the weekly orders as samples from a Poisson distribution with an unknown rate.
4799
Choose a prior you think is appropriate and use the data to compute the posterior distribution of the order rate.
4800
Then generate a posterior predictive distribution for the number of copies you expect during the next 8 weeks.
4801

4802
\begin{itemize}
4803

4804
\item Suppose the cost of printing the book is \$5 per copy,
4805

4806
\item But if you order 100 or more, it's \$4.50 per copy.
4807

4808
\item For every book you sell, you get \$10.
4809

4810
\item But if you run out of books before the end of 8 weeks, you lose \$50 in future sales for every week you are out of stock.
4811

4812
\item If you have books left over at the end of 8 weeks, you lose \$2 in inventory costs per extra book.
4813
\end{itemize}
4814

4815
For example, suppose you get orders for 10 books per week, every week.
4816

4817
If you order 60 books,
4818
\begin{itemize}
4819

4820
\item The total cost is \$300.
4821

4822
\item You sell all 80 books, so you make \$600.
4823

4824
\item But the book is out of stock for two weeks, so you lose \$100 in future sales.
4825
\end{itemize}
4826

4827
In total, your profit is \$200.
4828

4829
If you order 100 books,
4830
\begin{itemize}
4831

4832
\item The total cost is \$450.
4833

4834
\item You sell 80 books, again, so you make \$800.
4835

4836
\item But you have 20 books left over at the end, so you lose \$40.
4837
\end{itemize}
4838

4839
In total, your profit is \$310.
4840

4841
Combining these costs with your predictive distribution, how many books should you order to maximize your expected profit?
4842

4843
In the notebook for this chapter, I provide some code to get you started.
4844

4845
\end{exercise}
4846

4847

4848

4849

4850
\chapter{Comparisons}
4851
\label{comparison}
4852

4853
The Elo rating system is a way to quantify the skill of players for games like chess (see \url{https://en.wikipedia.org/wiki/Elo_rating_system}).
4854

4855
It is based on a model of the relationship between the ratings of players and the outcome of a game.
4856
Specifically, if $R_A$ is the rating of player \py{A} and $R_B$ is the rating of player \py{B}, the probability that \py{A} beats \py{B} is given by the logistic function (see \url{https://en.wikipedia.org/wiki/Logistic_function}):
4857

4858
$\p{\mathrm{A~beats~B}} = 1 / (1 + 10^{(R_B-R_A)/400})$
4859

4860
The parameters $10$ and $400$ are arbitrary choices that determine the range of the ratings.  In chess, the range is from 100 to 2800.
4861

4862
Suppose \py{A} has a current rating of 1600 and \py{B} has a current rating of 1800.
4863
Then \py{A} and \py{B} play and \py{A} wins.  How should we update their ratings?
4864

4865
In this chapter I will solve a simpler version of this question; then you will have a chance to finish it off as an exercise.
4866

4867
This chapter introduces {\tt joint distributions}, which represent the distributions of two or more variables and the relationships among them.
4868

4869
We'll extend the Bayesian update process we've seen in previous chapter and apply it to a joint distribution.
4870

4871
But first I will introduce a tool we will use to construct joint distributions and compute likelihoods: outer operations.
4872

4873

4874
\section{Outer operations}
4875
\label{outer-operations}
4876

4877
Many useful operations can be expressed in the form of an {\bf outer operation} of two sequences.
4878
Suppose you have sequences like \py{t1} and \py{t2}:
4879

4880
\begin{code}
4881
t1 = [1,3,5]
4882
t2 = [2,4]
4883
\end{code}
4884

4885
The most common outer operation is the outer product, which computes the product of every pair of values, one from each sequence.
4886

4887
For example, here is the outer product of \py{t1} and \py{t2}:
4888

4889
\begin{code}
4890
a = np.multiply.outer(t1, t2)
4891
\end{code}
4892

4893
The result is a NumPy array, but it's easier to understand what it is if I put it in a DataFrame:
4894

4895
\begin{code}
4896
df = pd.DataFrame(a, index=t1, columns=t2)
4897
\end{code}
4898

4899
Here's the result:
4900

4901
\input{tables/table09-02}
4902

4903
The values from \py{t1} appear along the rows; the values from \py{t2} appear along the columns.
4904

4905
Each element in the array is the product of an element from \py{t1} and an element from \py{t2}.
4906

4907
The outer sum is similar, except that each element is the {\em sum} of an element from \py{t1} and an element from \py{t2}.
4908

4909
\begin{code}
4910
a = np.add.outer(t1, t2)
4911
df = pd.DataFrame(a, index=t1, columns=t2)
4912
\end{code}
4913

4914
Here's the result:
4915

4916
\input{tables/table09-02}
4917

4918
These outer operations work with Python lists and tuples, and NumPy arrays, but not Pandas \py{Series}.
4919

4920
So I'll use the following function, which takes two Pandas \py{Series} and puts the result into a \py{DataFrame}.
4921

4922
\begin{code}
4923
def outer_product(s1, s2):
4924
    a = np.multiply.outer(s1.to_numpy(), s2.to_numpy())
4925
    return pd.DataFrame(a, index=s1.index, columns=s2.index)
4926
\end{code}
4927

4928
It might not be obvious yet why these operations are useful, but we'll see some examples soon.
4929

4930
With that, we are ready to take on a new Bayesian problem.
4931

4932
\section{How tall is A?}
4933

4934
Suppose I choose two people from the population of adult males in the United States, and call them A and B.  If we see that A taller than B, how tall is A?
4935

4936
To answer this question:
4937

4938
\begin{enumerate}
4939

4940
\item I'll use background information about the height of men in the U.S. to form a prior distribution of height,
4941

4942
\item I'll construct a joint distribution of height for A and B (and I'll explain what that is);
4943

4944
\item Then I'll update the prior with the information that A is taller, and
4945

4946
\item From the posterior joint distribution I'll extract the posterior distribution of height for A.
4947

4948
\end{enumerate}
4949

4950
In the U.S. the average height of male adults in 178 cm and the standard deviation is 7.7 cm.  The distribution is not exactly normal, because nothing in the real world is, but the normal distribution is a pretty good model of the actual distribution, so we can use it as a prior distribution for A and B.
4951

4952
Here's an array of equally-spaced values from roughly 3 standard deviations below the mean to 3 standard deviations above.
4953

4954
\begin{code}
4955
mean = 178
4956
std = 7.7
4957
qs = np.arange(mean-24, mean+24, 0.5)
4958
\end{code}
4959

4960
SciPy provides a function called \py{norm} that represents a normal distribution with a given mean and standard deviation, and provides \py{pdf}, which evaluates the normal probability distribution function (PDF), which we will use as the prior probabilities.
4961

4962
\begin{code}
4963
from scipy.stats import norm
4964
ps = norm(mean, std).pdf(qs)
4965
\end{code}
4966

4967
I'll store the \py{ps} and \py{qs} in a \py{Pmf} that represents the prior distribution.
4968

4969
\begin{code}
4970
prior = Pmf(ps, qs)
4971
prior.normalize()
4972
\end{code}
4973

4974
This distribution represents what we believe about the heights of \py{A} and \py{B} before we take into account the data that \py{A} is taller.
4975

4976

4977
\section{Joint distribution}
4978

4979
The next step is to construct a distribution that represents the probability of every pair of heights, which is called a joint distribution.
4980
The elements of the joint distribution are
4981

4982
$\p{A_y~\mathrm{and}~B_x}$
4983

4984
which is the probability that \py{A} is $y$ cm tall and \py{B} is $x$ cm tall, for all values of $y$ and $x$.
4985

4986
At this point all we know about \py{A} and \py{B} is that they are male residents of the U.S., so their heights are independent; that is, knowing the height of \py{A} provides no additional information about the height of \py{B}.
4987
In that case, we can compute the joint probabilities like this:
4988

4989
$\p{A_y~\mathrm{and}~B_x} = \p{A_y}~\p{B_x}$
4990

4991
Each joint probability is the product of one element from the distribution for \py{A} and one element from the distribution for \py{B}.
4992
So we can compute the joint distribution using \py{outer_product}:
4993

4994
\begin{code}
4995
joint = outer_product(prior, prior)
4996
joint.shape
4997
\end{code}
4998

4999
The result is a \py{DataFrame} with possible heights of \py{A} along the rows, heights of \py{B} along the columns, and the joint probabilities as elements.
5000

5001
The following function uses \py{pcolormesh} to plot the joint distribution.
5002

5003
\begin{code}
5004
def plot_joint(joint):
5005
    plt.pcolormesh(joint.columns, joint.index, joint)
5006
    plt.colorbar()
5007
    decorate(ylabel='A height in cm',
5008
             xlabel='B height in cm')
5009
\end{code}
5010

5011
Recall that \py{outer_product} puts the values of \py{A} along the rows and the values of \py{B} across the columns.
5012

5013
Figure~\ref{fig09-01} shows the results.
5014

5015
\begin{figure}
5016
% chap09soln.ipynb
5017
\centerline{\includegraphics[width=4in]{figs/fig09-01.pdf}}
5018
\caption{Joint prior distribution of height for A and B.}
5019
\label{fig09-01}
5020
\end{figure}
5021

5022
As you might expect, the probability is highest near the mean and drops off away from the mean.
5023

5024

5025
\section{Likelihood}
5026

5027
Now that we have a joint prior distribution, we can update it with the data, which is that \py{A} is taller than \py{B}.
5028

5029
Each element in the joint distribution represents a hypothesis about the heights of \py{A} and \py{B}; for example:
5030

5031
\begin{enumerate}
5032

5033
\item The element \py{(180, 170)} represents the hypothesis that \py{A} is 180 cm tall and \py{B} is 170 cm tall.  Under this hypothesis, the probability that \py{A} is taller than \py{B} is 1.
5034

5035
\item The element \py{(170, 180)} represents the hypothesis that \py{A} is 170 cm tall and \py{B} is 180 cm tall.  Under this hypothesis, the probability that \py{A} is taller than \py{B} is 0.
5036

5037
\end{enumerate}
5038

5039
To compute the likelihood of every pair of values, we can extract the quantities from the joint prior, like this:
5040

5041
\begin{code}
5042
Y = joint.index.to_numpy()
5043
X = joint.columns.to_numpy()
5044
\end{code}
5045

5046
And then apply the \py{outer} version of \py{np.subtract}, which computes the difference between every element of \py{Y} (height of \py{A}) and every element of \py{X} (height of \py{B}).
5047

5048
\begin{code}
5049
diff = np.subtract.outer(Y, X)
5050
\end{code}
5051

5052
The result is an array of differences.  To compute likelihoods, we use \py{np.where} which puts \py{1} where the \py{diff} is greater than 0 and 0 elsewhere.
5053

5054
\begin{code}
5055
a = np.where(diff>0, 1, 0)
5056
\end{code}
5057

5058
The result is an array of likelihoods, which I will put in a \py{DataFrame} with the values of \py{Y} in the index and the values of \py{X} in the columns.
5059

5060
\begin{code}
5061
likelihood = pd.DataFrame(a, index=Y, columns=X)
5062
\end{code}
5063

5064
Figure~\ref{fig09-02} shows the likelihood that A is taller than B for each hypothetical pair of heights.
5065

5066
\begin{figure}
5067
% chap09soln.ipynb
5068
\centerline{\includegraphics[width=4in]{figs/fig09-02.pdf}}
5069
\caption{Likelihood that A is taller than B for each hypothetical pair of heights.}
5070
\label{fig09-02}
5071
\end{figure}
5072

5073
We have a prior, we have a likelihood, and we are ready for the update.
5074

5075
\section{The update}
5076

5077
As usual, the unnormalized posterior is the product of the prior and the likelihood.
5078

5079
\begin{code}
5080
posterior = joint * likelihood
5081
\end{code}
5082

5083
I'll use the following function to normalize the posterior:
5084

5085
\begin{code}
5086
def normalize(joint):
5087
    prob_data = joint.to_numpy().sum()
5088
    joint /= prob_data
5089
\end{code}
5090

5091
We have to convert the \py{DataFrame} to a NumPy array before calling \py{sum}.  Otherwise, \py{DataFrame.sum} would compute the sums of the columns and return a \py{Series}.
5092

5093
Now we can normalize the posterior:
5094

5095
\begin{code}
5096
normalize(posterior)
5097
\end{code}
5098

5099
Figure~\ref{fig09-03} shows the result.
5100

5101
\begin{figure}
5102
% chap09soln.ipynb
5103
\centerline{\includegraphics[width=4in]{figs/fig09-03.pdf}}
5104
\caption{Joint posterior distribution of height for A and B.}
5105
\label{fig09-03}
5106
\end{figure}
5107

5108
For all hypotheses where \py{A} is not taller than \py{B}, the posterior probability is 0.
5109

5110

5111
\section{The marginals}
5112
\label{marginals}
5113

5114
The joint posterior distribution represents what we believe about the heights of \py{A} and \py{B}, given the prior distributions and the information that \py{A} is taller.
5115

5116
From this joint distribution, we can compute posterior distributions for \py{A} and \py{B}.  To see how, let's start with a simpler problem.
5117

5118
Suppose we want to know the probability that \py{B} is 180 cm tall.  We can select the column from the joint distribution where \py{X=180}.
5119

5120
\begin{code}
5121
column = posterior[180]
5122
\end{code}
5123

5124
This column contains posterior probabilities for all cases where \py{X=180}; if we add them up, we get the total probability that \py{B} is 180 cm tall.
5125

5126
\begin{code}
5127
column.sum()
5128
\end{code}
5129

5130
Now, to get the posterior distribution of height for \py{B}, we can add up all of the columns, like this:
5131

5132
\begin{code}
5133
column_sums = posterior.sum(axis=0)
5134
\end{code}
5135

5136
The argument \py{axis=0} means we want to sum the elements along the rows; that is, we want to add up the columns.
5137

5138
The result is a \py{Series} that contains every possible height for \py{B} and its probability.  In other words, it is the distribution of heights for \py{B}.
5139

5140
We can put it in a \py{Pmf} like this:
5141

5142
\begin{code}
5143
marginal_B = Pmf(column_sums)
5144
\end{code}
5145

5146
When we extract the distribution of a single variable from a joint distribution, the result is called a {\bf marginal distribution}.
5147
The name comes from a common visualization that shows the joint distribution in the middle and the marginal distributions in the margins.
5148

5149
Similarly, we can get the posterior distribution of height for \py{A} by adding up the rows and putting the result in a \py{Pmf}.
5150

5151
\begin{code}
5152
row_sums = posterior.sum(axis=1)
5153
marginal_A = Pmf(row_sums)
5154
\end{code}
5155

5156
The following function takes a joint distribution and an axis number, and returns a marginal distribution.
5157

5158
\begin{code}
5159
def marginal(joint, axis):
5160
    return Pmf(joint.sum(axis=axis))
5161
\end{code}
5162

5163
So we can compute the marginal distributions like this.
5164

5165
\begin{code}
5166
marginal_B = marginal(posterior, axis=0)
5167
marginal_A = marginal(posterior, axis=1)
5168
\end{code}
5169

5170
Figure~\ref{fig09-04} shows what they look like.
5171

5172
\begin{figure}
5173
% chap09soln.ipynb
5174
\centerline{\includegraphics[width=4in]{figs/fig09-04.pdf}}
5175
\caption{Prior and posterior distributions for A and B.}
5176
\label{fig09-04}
5177
\end{figure}
5178

5179
As you might expect, the posterior distribution for \py{A} is shifted to the right and the posterior distribution for \py{B} is shifted to the left.
5180

5181
Based on the observation that \py{A} is taller than \py{B}, we are inclined to believe that \py{A} is a little taller than average, and \py{B} is a little shorter.
5182

5183
Notice that the posterior distributions are a little narrower than the prior.
5184
The standard deviations of the posterior distributions are a little smaller, which means we are a little more certain about the heights of \py{A} and \py{B} after we compare them.
5185

5186

5187
\section{Conditional posteriors}
5188

5189
Now suppose we measure \py{B} and find that he is 180 cm tall.  What does that tell us about \py{A}?
5190

5191
In the joint distribution, each column corresponds a possible height for \py{B}.  We can select the column that corresponds to height 180 cm like this:
5192

5193
\begin{code}
5194
column_180 = posterior[180]
5195
\end{code}
5196

5197
The result is a \py{Series} that represents possible heights for \py{A} and their relative likelihoods.
5198
These likelihoods are not normalized, but we can normalize them like this:
5199

5200
\begin{code}
5201
cond_A = Pmf(column_180)
5202
cond_A.normalize()
5203
\end{code}
5204

5205
The result is the {\bf conditional distribution} of height for \py{A} given that \py{B} is 180 cm tall.
5206
Figure~\ref{fig09-05} shows what it looks like.
5207

5208
Note that when we make a \py{Pmf} it copies the data by default, so we can modify \py{cond_A} without affecting \py{column_180} or \py{posterior}.
5209

5210
\begin{figure}
5211
% chap09soln.ipynb
5212
\centerline{\includegraphics[width=4in]{figs/fig09-05.pdf}}
5213
\caption{.}
5214
\label{fig09-05}
5215
\end{figure}
5216

5217
The conditional distribution is cut off at 180 cm, because we have established that \py{A} is taller than \py{B} and \py{B} is 180 cm.
5218

5219
\section{Dependence and independence}
5220

5221
When we constructed the joint prior distribution, I said that the heights of \py{A} and \py{B} were independent, which means that knowing one of them provides no information about the other.
5222
In other words, the conditional probability $\p{A_y | B_x}$ is the same as the unconditioned probability $\p{A_y}$.
5223

5224
That's why we can compute an element of the joint prior, $\p{A_y~\mathrm{and}~B_x}$, by rewriting it in terms of conditional probability, $\p{B_x}~\p{A_y~|~B_x}$, and using the independence of $A$ and $B$ to replace the conditional probability.
5225

5226
Putting it all together, we have
5227

5228
$\p{A_y~\mathrm{and}~B_x} = \p{B_x}~\p{A_y}$
5229

5230
But remember, that's only true if $A$ and $B$ are independent.
5231
In the posterior distribution, they are not.
5232
We know that \py{A} is taller than \py{B}, so if we know how tall \py{B} is, that gives us information about \py{A}.
5233

5234
The conditional distribution we just computed demonstrates this dependence.
5235

5236

5237
\section{Summary}
5238

5239
In this chapter I started with the ``outer'' operations, like outer product, which we used to construct a joint distribution.
5240

5241
In general, you cannot construct a joint distribution from two marginal distributions, but in the special case where the distributions are independent, you can.
5242

5243
We extended the Bayesian update process we've seen in previous chapters and applied it to a joint distribution.  Then from the posterior joint distribution we extracted posterior marginal distributions and posterior conditional distributions.
5244

5245
As an exercise, you'll have a chance to apply the same process to a slightly more difficult problem, updating Elo ratings based on the outcome of a chess game.
5246

5247

5248
\section{Exercises}
5249

5250
The code for this chapter is in \py{chap09.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
5251
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap09.ipynb}.
5252

5253
The notebook provides space where you can work on the following problems.
5254

5255
\begin{exercise}
5256
Based on the results of the previous example, compute the posterior conditional distribution for \py{B} given that \py{A} is 190 cm.
5257
\end{exercise}
5258

5259

5260
\begin{exercise}
5261
Suppose we have established that \py{A} is taller than \py{B}, but we don't know how tall \py{B} is.
5262
Now we choose a random woman, \py{C}, and find that she is shorter than \py{A} by at least 15 cm.  Compute posterior distributions for the heights of \py{A} and \py{C}.
5263

5264
The average height for women in the U.S. is 163 cm; the standard deviation is 7.3 cm.
5265
\end{exercise}
5266

5267

5268
\begin{exercise}
5269
At the beginning of this chapter, I introduced
5270
the Elo rating system, which is used to quantify the skill level of players for games like chess.
5271

5272
It is based on a model of the relationship between the ratings of players and the outcome of a game.  Specifically, if $R_A$ is the rating of player \py{A} and $R_B$ is the rating of player \py{B}, the probability that \py{A} beats \py{B} is given by the logistic function:
5273

5274
$\p{\mathrm{A~beats~B}} = 1 / (1 + 10^{(R_B-R_A)/400})$
5275

5276
Suppose \py{A} has a current rating of 1600, but we are not sure it is accurate.  We could describe their true rating with a normal distribution with mean 1600 and standard deviation 100, to indicate our uncertainty.
5277

5278
And suppose \py{B} has a current rating of 1800, with the same level of uncertainty.
5279

5280
Then \py{A} and \py{B} play and \py{A} wins.  How should we update their ratings?
5281

5282
To answer this question:
5283

5284
\begin{enumerate}
5285

5286
\item Construct prior distributions for \py{A} and \py{B}.
5287

5288
\item Use them to construct a joint distribution, assuming that the prior distributions are independent.
5289

5290
\item Use the logistic function above to compute the likelihood of the outcome under each joint hypothesis.
5291

5292
\item Use the joint prior and likelihood to compute the joint posterior.
5293

5294
\item Extract and plot the marginal posteriors for \py{A} and \py{B}.
5295

5296
\item Compute the posterior means for \py{A} and \py{B}.  How much should their ratings change based on this outcome?
5297

5298
\end{enumerate}
5299

5300
\end{exercise}
5301

5302

5303

5304
\chapter{Classification}
5305
\label{classification}
5306

5307

5308
Classification might be the most well-known application of Bayesian
5309
methods, made famous as the basis of the first generation of spam
5310
filters in the 1990s (see \url{https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering}).
5311

5312
In this chapter, I'll demonstrate Bayesian classification using data
5313
collected and made available by Dr.~Kristen Gorman at the Palmer
5314
Long-Term Ecological Research Station in Antarctica. We'll use this data
5315
to classify penguins by species.
5316

5317
This dataset was published to support this article: Gorman, Williams,
5318
and Fraser, ``Ecological
5319
Sexual Dimorphism and Environmental Variability within a Community of
5320
Antarctic Penguins (Genus \emph{Pygoscelis})'', March 2014, which you can read at \url{https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081}.
5321

5322
The dataset contains one row for each penguin and one column for each
5323
variable, including the measurements we will use for classification.
5324
We can read it into a \py{DataFrame} like this:
5325

5326
\begin{code}
5327
df = pd.read_csv('penguins_raw.csv')
5328
\end{code}
5329

5330
Three species of penguins are represented in the dataset: Adelie,
5331
Chinstrap and Gentoo.
5332
The measurements we'll use to classify them are:
5333

5334
\begin{itemize}
5335
\item
5336
  Body Mass in grams (g).
5337
\item
5338
  Flipper Length in millimeters (mm).
5339
\item
5340
  Culmen Length in millimeters.
5341
\item
5342
  Culmen Depth in millimeters.
5343
\end{itemize}
5344

5345
If you are not familiar with the word ``culmen'', it refers to the
5346
top margin of the beak (see \url{https://en.wikipedia.org/wiki/Bird_measurement\#Culmen}).
5347

5348

5349
\section{Distributions of measurements}
5350
\label{distributions-of-measurements}
5351

5352
These measurements will be most useful for classification if there are
5353
substantial differences between species and small variation within
5354
species. To see whether that is true, and to what degree, I will plot
5355
cumulative distribution functions (CDFs) of each measurement for each
5356
species.
5357

5358
The following function takes the \py{DataFrame} and
5359
a column name, and returns a dictionary that maps from each species name
5360
to a \py{Cdf} of the values in the given column.
5361

5362
\begin{code}
5363
def make_cdf_map(df, varname, by='Species2'):
5364
    cdf_map = {}
5365
    grouped = df.groupby(by)[varname]
5366
    for species, group in grouped:
5367
        cdf_map[species] = Cdf.from_seq(group, name=species)
5368
    return cdf_map
5369
\end{code}
5370

5371
Figure~\ref{fig10-01} shows
5372

5373
\begin{figure}
5374
% chap01soln.ipynb
5375
\centerline{\includegraphics[width=5.5in]{figs/fig10-01.pdf}}
5376
\caption{}
5377
\label{fig10-01}
5378
\end{figure}
5379

5380
It looks like we can use culmen length to identify Adelie penguins, but
5381
the distributions for the other two species almost entirely overlap.
5382

5383
Using flipper length, we can distinguish Gentoo penguins from the other
5384
two species. So with just these two features, it seems like we should be
5385
able to classify penguins with some accuracy.
5386

5387
Culmen depth and body mass distinguish Gentoo penguins from the other
5388
two species, but these features might not add a lot of additional
5389
information, beyond flipper length and culmen length.
5390

5391
All of these CDFs show the sigmoid shape characteristic of the normal
5392
distribution; I will take advantage of that observation in the next
5393
section.
5394

5395
\section{Normal models}
5396
\label{normal-models}
5397

5398
Now let's use these features to classify penguins. I'll proceed in the
5399
usual Bayesian way:
5400

5401
\begin{enumerate}
5402

5403
\item
5404
  I'll define a prior distribution that represents a hypothesis for each
5405
  species and a prior probability.
5406
\item
5407
  I'll compute the likelihood of the data under each hypothesis, and
5408
  then
5409
\item
5410
  Compute the posterior probability of each hypothetical species.
5411
\end{enumerate}
5412

5413
To compute the likelihood of the data under each hypothesis, I will use
5414
the data to estimate the parameters of a normal distribution for each
5415
feature and each species.
5416

5417
The following function takes a \py{DataFrame} and a
5418
column name; it returns a dictionary that maps from each species name to
5419
a \py{norm} object. \py{norm}
5420
is defined in SciPy; it represents a normal distribution with a given
5421
mean and standard deviation.
5422

5423
\begin{code}
5424
from scipy.stats import norm
5425

5426
def make_norm_map(df, varname, by='Species2'):
5427
    norm_map = {}
5428
    grouped = df.groupby(by)[varname]
5429
    for species, group in grouped:
5430
        mean = group.mean()
5431
        std = group.std()
5432
        norm_map[species] = norm(mean, std)
5433
    return norm_map
5434
\end{code}
5435

5436
For example, here's how we estimate the distributions of flipper length
5437
for the three species.
5438

5439
\begin{code}
5440
flipper_map = make_norm_map(df, 'Flipper Length (mm)')
5441
\end{code}
5442

5443
As usual I will use a \py{Pmf} to represent the
5444
prior distribution. For simplicity, I'll assume that the three species
5445
are equally likely.
5446

5447
\begin{code}
5448
hypos = flipper_map.keys()
5449
prior = Pmf(1/3, hypos)
5450
prior
5451
\end{code}
5452

5453
Now suppose we measure a penguin and find that its flipper is 210 cm.
5454
What is the probability of that measurement under each hypothesis?
5455

5456
The \py{norm} object provides
5457
\py{pdf}, which computes the probability density
5458
function (PDF) of the normal distribution. We can use it to compute the
5459
likelihood of the observed data in a given distribution.
5460

5461
\begin{code}
5462
data = 210
5463
flipper_map['Adelie'].pdf(data)
5464
\end{code}
5465

5466
The result is a probability density, so we can't interpret it as a
5467
probability. But it is proportional to the likelihood of the data, so we
5468
can use it to update the prior.
5469

5470
Here's how we compute the likelihood of the data in each distribution.
5471

5472
\begin{code}
5473
likelihood = [flipper_map[hypo].pdf(data) for hypo in hypos]
5474
\end{code}
5475

5476
Now we can do the update in the usual way.
5477

5478
\begin{code}
5479
posterior = prior * likelihood
5480
posterior.normalize()
5481
\end{code}
5482

5483
And here are the results:
5484

5485
\input{tables/table10-01}
5486

5487
A penguin with a 210 mm flipper has an 80\% chance of being a Gentoo and
5488
about an 19\% chance of being a Chinstrap (assuming that the three
5489
species were equally likely before the measurement).
5490

5491
The following function encapsulates the steps we just ran. It takes a
5492
\py{Pmf} representing the prior distribution, the
5493
observed data, and a map from each hypothesis to the distribution of the
5494
feature.
5495

5496
\begin{code}
5497
def update_penguin(prior, data, norm_map):
5498
    hypos = prior.qs
5499
    likelihood = [norm_map[hypo].pdf(data) for hypo in hypos]
5500
    posterior = prior * likelihood
5501
    posterior.normalize()
5502
    return posterior
5503
\end{code}
5504

5505
The return value is the posterior distribution.
5506

5507
As we saw in the CDFs, flipper length does not distinguish strongly
5508
between Adelie and Chinstrap penguins. If a penguin has a 190 mm
5509
flipper, it is almost certainly not a Gentoo, but it is almost equally
5510
likely to be Adelie or Chinstrap.
5511

5512
\begin{code}
5513
posterior2 = update_penguin(prior, 190, flipper_map)
5514
\end{code}
5515

5516
But culmen length \emph{can} make this distinction. We can estimate
5517
distributions of culmen length for each species like this:
5518

5519
\begin{code}
5520
culmen_map = make_norm_map(df, 'Culmen Length (mm)')
5521
\end{code}
5522

5523
A penguin with culmen length 38 mm is almost certainly an Adelie.
5524

5525
\begin{code}
5526
posterior3 = update_penguin(prior, 38, culmen_map)
5527
\end{code}
5528

5529
With culmen length 48 mm, it is probably not an Adelie, but it's about
5530
equally likely to be a Chinstrap or Gentoo.
5531

5532
\begin{code}
5533
posterior4 = update_penguin(prior, 48, culmen_map)
5534
\end{code}
5535

5536
Using one feature at a time, sometimes we can classify penguins with
5537
high confidence; sometimes we can't. We can do better using multiple
5538
features.
5539

5540
\section{Naive Bayesian classification}
5541
\label{naive-bayesian-classification}
5542

5543
To make it easier to do multiple updates, I'll use the following
5544
function, which takes a prior \py{Pmf}, sequence of
5545
measurements and a corresponding sequence of dictionaries containing
5546
estimated distributions.
5547

5548
\begin{code}
5549
def update_naive(prior, data_seq, norm_maps):
5550
    posterior = prior.copy()
5551
    for data, norm_map in zip(data_seq, norm_maps):
5552
        posterior = update_penguin(posterior, data, norm_map)
5553
    return posterior
5554
\end{code}
5555

5556
The return value is a posterior \py{Pmf}.
5557

5558
I'll use the same features we looked at in the previous section: culmen
5559
length and flipper length.
5560

5561
\begin{code}
5562
varnames = ['Culmen Length (mm)', 'Flipper Length (mm)']
5563
norm_maps = [culmen_map, flipper_map]
5564
\end{code}
5565

5566
Now suppose we find a penguin with culmen length 48 mm and flipper
5567
length 210 mm. Here's the update:
5568

5569
\begin{code}
5570
data_seq = 48, 210
5571
posterior = update_naive(prior, data_seq, norm_maps)
5572
\end{code}
5573

5574
It's most likely to be a Gentoo.
5575

5576
I'll loop through the dataset and classify each penguin with these two
5577
features.
5578

5579
\begin{code}
5580
df['Classification'] = np.nan
5581
for i, row in df.iterrows():
5582
    data_seq = row[varnames]
5583
    posterior = update_naive(prior, data_seq, norm_maps)
5584
    df.loc[i, 'Classification'] = posterior.max_prob()
5585
\end{code}
5586

5587
The result is a new column in the \py{DataFrame}.
5588
So let's see how many we got right.
5589

5590
There are 344 penguins in the dataset, but two of them are missing
5591
measurements, so we have 342 valid cases.
5592
Of those, 324 are classified correctly, which is almost 95\%.
5593

5594
The classifier we used in this section is called ``naive'' because it
5595
ignores correlations between the features. To see why that matters, I'll
5596
make a less naive classifier: one that takes into account the joint
5597
distribution of the features.
5598

5599
\section{Joint distributions}
5600
\label{joint-distributions}
5601

5602
Let's see what the joint distribution looks like.
5603
I'll start by making a scatter plot of the data.
5604

5605
\begin{code}
5606
def scatterplot(df, var1, var2):
5607
    grouped = df.groupby('Species2')
5608
    for species, group in grouped:
5609
        plt.plot(group[var2], group[var1], 'o',
5610
                 alpha=0.4, label=species)
5611

5612
    decorate(ylabel=var1, xlabel=var2)
5613
\end{code}
5614

5615
Figure~\ref{fig01-02} shows a scatter plot of culmen length and flipper length for the three
5616
species.
5617

5618
\begin{figure}
5619
\centerline{\includegraphics[width=4in]{figs/fig10-02.pdf}}
5620
\caption{}
5621
\label{fig01-02}
5622
\end{figure}
5623

5624
Within each species, there is a clear correlation between culmen length
5625
and flipper length.
5626

5627
If we ignore these correlations, we are assuming that the features are
5628
independent. To see what that looks like, I'll make a joint distribution
5629
for each species assuming independence.
5630

5631
The following function makes a discrete \py{Pmf}
5632
that approximates a normal distribution.
5633
It takes a \py{norm} object as a parameter; \py{sigmas} is the number of standard deviations to include above and below the mean; \py{n} is the number of points in the result.
5634

5635
\begin{code}
5636
def make_pmf(dist, sigmas=3, n=101):
5637
    mean, std = dist.mean(), dist.std()
5638
    low = mean - sigmas * std
5639
    high = mean + sigmas * std
5640
    qs = np.linspace(low, high, n)
5641
    ps = dist.pdf(qs)
5642
    pmf = Pmf(ps, qs)
5643
    pmf.normalize()
5644
    return pmf
5645
\end{code}
5646

5647
We can use it, along with \py{outer_product} from Section~\ref{outer-operations}, to make a joint distribution of culmen length and
5648
flipper length for each species.
5649

5650
\begin{code}
5651
joint_map = {}
5652
for species in hypos:
5653
    pmf1 = make_pmf(culmen_map[species])
5654
    pmf2 = make_pmf(flipper_map[species])
5655
    joint_map[species] = outer_product(pmf1, pmf2)
5656
\end{code}
5657

5658
And we can use the joint distribution to generate a contour plot.
5659

5660
\begin{code}
5661
def plot_contour(joint, **options):
5662
    plt.contour(joint.columns, joint.index, joint, **options)
5663
\end{code}
5664

5665
Figure~\ref{fig10-03} compares the data to joint distributions that
5666
assume independence.
5667

5668
\begin{figure}
5669
\centerline{\includegraphics[width=4in]{figs/fig10-03.pdf}}
5670
\caption{}
5671
\label{fig10-03}
5672
\end{figure}
5673

5674
The contours of a joint normal distribution form ellipses.
5675
In this example, because the features are uncorrelated, the ellipses are
5676
aligned with the axes. But they are not well aligned with the data.
5677

5678
We can make a better model of the data, and use it to compute better
5679
likelihoods, with a multivariate normal distribution.
5680

5681

5682
\section{Multivariate normal distribution}
5683
\label{multivariate-normal-distribution}
5684

5685
As we have seen, a univariate normal distribution is characterized by
5686
its mean and standard deviation or variance (where variance is the
5687
square of standard deviation).
5688

5689
A multivariate normal distribution is characterized by the means of the
5690
features and the \textbf{covariance matrix}, which contains the
5691
variances, which quantify the spread of the features, and the
5692
covariances, which quantify the relationships among them.
5693

5694
We can use the data to estimate the means and covariance matrix for the
5695
population of penguins. First I'll select the columns we want.
5696

5697
\begin{code}
5698
features = df[[var1, var2]]
5699
features.head()
5700
\end{code}
5701

5702
And compute the means.
5703

5704
\begin{code}
5705
mean = features.mean()
5706
mean
5707
\end{code}
5708

5709
\begin{code}
5710
# convert to a DataFrame and write as a table
5711
mean_df = pd.DataFrame(mean, columns=['mean'])
5712
write_table(mean_df, 'table10-04')
5713
\end{code}
5714

5715
The result is a \py{Series} containing the mean
5716
culmen length and flipper length.
5717

5718
We can also compute the covariance matrix:
5719

5720
\begin{code}
5721
cov = features.cov()
5722
write_table(cov, 'table10-05')
5723
cov
5724
\end{code}
5725

5726
The results is a \py{DataFrame} with one row and
5727
one column for each feature. The elements on the diagonal are the
5728
variances; the elements off the diagonal are covariances.
5729

5730
SciPy provides a \py{multivariate_normal} object
5731
we can use to represent a multivariate normal distribution. It takes a
5732
sequence of means and a covariance matrix as parameters:
5733

5734
\begin{code}
5735
from scipy.stats import multivariate_normal
5736

5737
multinorm = multivariate_normal(mean, cov)
5738
multinorm
5739
\end{code}
5740

5741
The following function makes a
5742
\py{multivariate_normal} object for each species.
5743

5744
\begin{code}
5745
def make_multinorm_map(df, varnames):
5746
    multinorm_map = {}
5747
    grouped = df.groupby('Species2')
5748
    for species, group in grouped:
5749
        features = group[varnames]
5750
        mean = features.mean()
5751
        cov = features.cov()
5752
        multinorm_map[species] = multivariate_normal(mean, cov)
5753
    return multinorm_map
5754
\end{code}
5755

5756
And here's how we use it.
5757

5758
\begin{code}
5759
multinorm_map = make_multinorm_map(df, [var1, var2])
5760
\end{code}
5761

5762
In the next section we'll see what the multivariate normal distribution
5763
looks like.
5764

5765
Then we'll use them to classify penguins, and we'll see if the results
5766
are more accurate than the naive Bayesian classifier.
5767

5768

5769
\section{Visualizing a multivariate normal distribution}
5770
\label{visualizing-a-multivariate-normal-distribution}
5771

5772
This section uses some NumPy magic to generate contour plots for
5773
multivariate normal distributions. If that's interesting for you, great!
5774
Otherwise, feel free to skip to the results. In the next section we'll
5775
do the actual classification, which turns out to be easier than the
5776
visualization.
5777

5778
I'll start by making a contour map for the distribution of features
5779
among Adelie penguins.\\
5780
Here are the univariate distributions for the two features we'll use and
5781
the multivariate distribution we just computed.
5782

5783
\begin{code}
5784
norm1 = culmen_map['Adelie']
5785
norm2 = flipper_map['Adelie']
5786
multinorm = multinorm_map['Adelie']
5787
\end{code}
5788

5789
I'll make a discrete \py{Pmf} approximation for
5790
each of the univariate distributions.
5791

5792
\begin{code}
5793
pmf1 = make_pmf(norm1)
5794
pmf2 = make_pmf(norm2)
5795
\end{code}
5796

5797
And use them to make a mesh that contains all pairs of values.
5798

5799
\begin{code}
5800
X, Y = np.meshgrid(pmf1.qs, pmf2.qs)
5801
\end{code}
5802

5803
The mesh is represented by two arrays, one containing the values along
5804
the $x$ axis, the other containing the values along the $y$ axis.
5805

5806
In order to evaluate the multivariate distribution for each pair of
5807
values, we have to ``stack'' the arrays.
5808

5809
\begin{code}
5810
pos = np.dstack((X, Y))
5811
\end{code}
5812

5813
The result is a 3-D array that you can think of as a 2-D array of pairs.
5814
When we pass this array to \py{multinorm.pdf}, it
5815
evaluates the probability density function of the distribution for each
5816
pair of values.
5817

5818
\begin{code}
5819
a = multinorm.pdf(pos)
5820
\end{code}
5821

5822
The result is an array of probability densities. If we put them in a
5823
\py{DataFrame} and normalize them, the result is a
5824
discrete approximation of the joint distribution of the two features.
5825

5826
\begin{code}
5827
joint = pd.DataFrame(a, index=pmf1.qs, columns=pmf2.qs)
5828
normalize(joint)
5829
\end{code}
5830

5831
Which we can plot with \py{plot_contour}:
5832

5833
\begin{code}
5834
plot_contour(joint)
5835
\end{code}
5836

5837
Figure~\ref{fig10-04} shows a scatter plot of the data along with the
5838
contours of the multivariate normal distribution for each species.
5839

5840
\begin{figure}
5841
% chap01soln.ipynb
5842
\centerline{\includegraphics[width=4in]{figs/fig10-04.pdf}}
5843
\caption{}
5844
\label{fig10-04}
5845
\end{figure}
5846

5847
The contours of a multivariate normal distribution are still ellipses,
5848
but now that we have taken into account the correlation between the
5849
features, the ellipses are no longer aligned with the axes.
5850

5851
Because it takes the correlations into account, the multivariate normal
5852
distribution is a better model for the data. And there is less overlap
5853
in the contours of the three distributions, which suggests that they
5854
should yield better classifications.
5855

5856
\section{A less naive classifier}
5857
\label{a-less-naive-classifier}
5858

5859
In a previous section we used \py{update_penguin}
5860
to update a prior \py{Pmf} based on observed data
5861
and a collection of \py{norm} objects that model
5862
the distribution of observations under each hypothesis. Here it is
5863
again:
5864

5865
\begin{code}
5866
def update_penguin(prior, data, norm_map):
5867
    hypos = prior.qs
5868
    likelihood = [norm_map[hypo].pdf(data) for hypo in hypos]
5869
    posterior = prior * likelihood
5870
    posterior.normalize()
5871
    return posterior
5872
\end{code}
5873

5874
I wrote this function with \py{norm} objects in
5875
mind, but it also works if the distributions in
5876
\py{norm_map} are
5877
\py{multivariate_normal} objects. So we can call
5878
it like this:
5879

5880
\begin{code}
5881
data = 38, 190
5882
update_penguin(prior, data, multinorm_map)
5883
\end{code}
5884

5885
A penguin with culmen length 38 and flipper length 190 is almost
5886
certainly an Adelie.
5887

5888
\begin{code}
5889
data = 48, 195
5890
update_penguin(prior, data, multinorm_map)
5891
\end{code}
5892

5893
A penguin with culmen length 48 and flipper length 195 is almost
5894
certainly a Chinstrap.
5895

5896
\begin{code}
5897
data = 48, 215
5898
update_penguin(prior, data, multinorm_map)
5899
\end{code}
5900

5901
And a penguin with culmen length 48 and flipper length 215 is almost
5902
certainly a Gentoo.
5903

5904
Let's see if this classifier does any better than the naive Bayesian
5905
classifier. I'll apply it to each penguin in the dataset:
5906

5907
\begin{code}
5908
df['Classification'] = np.nan
5909

5910
for i, row in df.iterrows():
5911
    data = row[varnames]
5912
    posterior = update_penguin(prior, data, multinorm_map)
5913
    df.loc[i, 'Classification'] = posterior.idxmax()
5914
\end{code}
5915

5916
And compute the accuracy:
5917

5918
\begin{code}
5919
accuracy(df)
5920
\end{code}
5921

5922
It turns out to be only a little better: the accuracy is 95.3\%,
5923
compared to 94.7\% for the naive Bayesian classifier.
5924

5925
In one way, that's disappointing. After all that work, it would have
5926
been nice to see a bigger difference.
5927

5928
But in another way, it's good news. In general, a naive Bayesian
5929
classifier is easier to implement and requires less computation. If it
5930
works nearly as well as a more complex algorithm, it might be a good
5931
choice for practical purposes.
5932

5933
But speaking of practical purposes, you might have noticed that this
5934
example isn't very useful. If we want to identify the species of a
5935
penguin, there are easier ways than measuring its flippers and beak.
5936

5937
However, there is are valid scientific uses for this type of
5938
classification. One of them is the subject of the research paper we
5939
started with:
5940
\url{https://en.wikipedia.org/wiki/Sexual_dimorphism}{sexual
5941
dimorphism}, that is, differences in shape between male and female
5942
animals.
5943

5944
In some species, like angler fish, males and females look very
5945
different. In other species, like mockingbirds, they are difficult to
5946
tell apart. And dimorphism is worth studying because it provides insight
5947
into social behavior, sexual selection, and evolution.
5948

5949
One way to quantify the degree of sexual dimorphism in a species is to
5950
use a classification algorithm like the one in this chapter. If you can
5951
find a set of features that makes it possible to classify individuals by
5952
sex with high accuracy, that's evidence of high dimorphism.
5953

5954
As an exercise, you can use the dataset from this chapter to classify
5955
penguins by sex and see which of the three species is the most
5956
dimorphic.
5957

5958
\section{Exercises}
5959

5960
The code for this chapter is in \py{chap10.ipynb}, which is in the repository for this book.  See Section~\ref{codeinfo} for details.
5961
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap10.ipynb}.
5962

5963
The notebook provides space where you can work on the following problems.
5964

5965
\begin{exercise} In my example I used culmen length and flipper length
5966
because they seemed to provide the most power to distinguish the three
5967
species. But maybe we can do better by using more features.
5968

5969
Make a naive Bayesian classifier that uses all four measurements in the
5970
dataset: culmen length and depth, flipper length, and body mass. Is it
5971
more accurate than the model with two features?
5972

5973
\end{exercise}
5974

5975

5976
\begin{exercise}
5977

5978
One of the reasons the penguin dataset was collected
5979
was to quantify sexual dimorphism in different penguin species, that is,
5980
physical differences between male and female penguins. One way to
5981
quantify dimorphism is to use measurements to classify penguins by sex.
5982
If a species is more dimorphic, we expect to be able to classify them
5983
more accurately.
5984

5985
As an exercise, pick a species and use a Bayesian classifier (naive or
5986
not) to classify the penguins by sex. Which features are most useful?
5987
What accuracy can you achieve?
5988
\end{exercise}
5989

5990

5991
\chapter{Inference}
5992

5993
Whenever people compare the Bayesian inference with conventional
5994
approaches, one of the questions that comes up most often is something
5995
like, ``What about p-values?'' And one of the most common examples is
5996
the comparison of two groups to see if there is a difference in their
5997
means.
5998

5999
In classical statistical inference, the usual tool for this scenario is
6000
a (\url{https://en.wikipedia.org/wiki/Student\%27s_t-test}) Student's
6001
\textit{t}-test, and the result is a
6002
(\url{https://en.wikipedia.org/wiki/P-value}) p-value. This process is
6003
an example of``null
6004
hypothesis significance testing''.
6005

6006
A Bayesian alternative is to compute the posterior distribution of the
6007
difference between the groups. Then we can use that distribution to
6008
answer whatever questions we are interested in, including the most
6009
likely size of the difference, a credible interval that's likely to
6010
contain the true difference, the probability of superiority, or the
6011
probability that the difference exceeds some threshold.
6012

6013
To demonstrate this process, I'll solve a standard problem from a
6014
statistical textbook, comparing the effect of an educational
6015
``treatment'' compared to a control.
6016

6017
\section{Improving Reading Ability}
6018

6019
We'll use data from a
6020
(\url{https://docs.lib.purdue.edu/dissertations/AAI8807671/})
6021
Ph.D.~dissertation in educational psychology written in 1987, which was used as an example
6022
in a
6023
(\url{https://books.google.com/books/about/Introduction_to_the_practice_of_statisti.html?id=pGBNhajABlUC})
6024
statistics textbook from 1989 and published on
6025
(\url{https://web.archive.org/web/20000603124754/http://lib.stat.cmu.edu/DASL/Datafiles/DRPScores.html}) DASL,
6026
a web page that collects data stories.
6027

6028
Here's the description from DASL:
6029

6030
\begin{quote}
6031
An educator conducted an experiment to test whether new directed reading
6032
activities in the classroom will help elementary school pupils improve
6033
some aspects of their reading ability. She arranged for a third grade
6034
class of 21 students to follow these activities for an 8-week period. A
6035
control classroom of 23 third graders followed the same curriculum
6036
without the activities. At the end of the 8 weeks, all students took a
6037
Degree of Reading Power (DRP) test, which measures the aspects of
6038
reading ability that the treatment is designed to improve.
6039
\end{quote}
6040

6041
The data are in the repository for this book.
6042
I'll use Pandas to load the data into a \py{DataFrame}:
6043

6044
\begin{code}
6045
import pandas as pd
6046

6047
df = pd.read_csv('drp_scores.csv', skiprows=21, delimiter='\t')
6048
\end{code}
6049

6050
And \py{groupby} to separate the data for the
6051
\py{Treated} and \py{Control}
6052
groups:
6053

6054
\begin{code}
6055
grouped = df.groupby('Treatment')
6056
responses = {}
6057

6058
for name, group in grouped:
6059
    responses[name] = group['Response']
6060
\end{code}
6061

6062
Figure~\ref{fig11-01} shows the cumulative distributions of the scores for the two groups, and here are their summary statistics.
6063

6064
\begin{stdout}
6065
Group      n     mean    std
6066
-----      --    ----    ---
6067
Control    23    41.5    17.1
6068
Treated    21    51.5    11.0
6069
\end{stdout}
6070

6071
\begin{figure}
6072
\centerline{\includegraphics[width=4in]{figs/fig11-01.pdf}}
6073
\caption{CDF of test scores for treated group and control group.}
6074
\label{fig11-01}
6075
\end{figure}
6076

6077
The distribution of scores is not exactly normal for either group, but
6078
it is close enough that the normal model is a reasonable choice.
6079

6080
So I'll assume that in the entire population of students (not just the
6081
ones in the experiment), the distribution of scores is well modeled by a
6082
normal distribution with unknown mean and standard deviation. I'll use
6083
\py{mu} and \py{sigma} to
6084
denote these unknown population parameters.
6085

6086
And we'll do a Bayesian update to estimate what they are.
6087

6088
\section{Estimating parameters}
6089

6090
As always, we need a prior distribution for the parameters.
6091
Since there are two parameters, it will be a joint distribution.
6092
I'll construct it by choosing marginal distributions for each parameter
6093
and computing their outer product.
6094

6095
As a simple starting place, I'll assume that the prior distributions for
6096
\py{mu} and \py{sigma} are
6097
uniform.
6098

6099
\begin{code}
6100
mus = np.linspace(20, 80, 101)
6101
prior_mu = Pmf(1, mus, name='mean')
6102

6103
sigmas = np.linspace(5, 30, 101)
6104
prior_sigma = Pmf(1, sigmas, name='std')
6105
\end{code}
6106

6107
Assuming that the parameters are independent, we can use \py{outer_product} from Section~\ref{outer-operations} to construct the joint prior distribution.
6108

6109
\begin{code}
6110
from utils import outer_product
6111

6112
prior = outer_product(prior_mu, prior_sigma)
6113
\end{code}
6114

6115
Now, we would like to know the probability of each score in the dataset
6116
for each hypothetical pair of values, \py{mu} and
6117
\py{sigma}. I'll do that by making a 3-dimensional
6118
grid with values of \py{sigma} on the first axis,
6119
values of \py{mu} on the second axis, and the
6120
scores from the control group on the third axis.
6121

6122
\begin{code}
6123
data = responses['Control']
6124

6125
sigmas, mus, data_mesh = np.meshgrid(prior.columns,
6126
                                     prior.index,
6127
                                     data)
6128
\end{code}
6129

6130
Now we can use \py{norm.pdf} to compute the
6131
probability density of each score for each hypothetical pair of
6132
parameters.
6133

6134
\begin{code}
6135
from scipy.stats import norm
6136

6137
densities = norm.pdf(data_mesh, sigmas, mus)
6138
\end{code}
6139

6140
The result is a 3-D array. To compute likelihoods, I'll compute the
6141
product of these densities along the third axis, that is,
6142
\py{axis=2}:
6143

6144
\begin{code}
6145
likelihood = densities.prod(axis=2)
6146
likelihood.shape
6147
\end{code}
6148

6149
The result is a 2-D array that contains the likelihood of the entire
6150
dataset for each hypothetical pair of parameters.
6151

6152
We can use this array as part of a Bayesian update, as in this function:
6153

6154
\begin{code}
6155
from utils import normalize
6156

6157
def update_norm(prior, data):
6158
    X, Y, Z = np.meshgrid(prior.columns, prior.index, data)
6159
    likelihood = norm.pdf(Z, Y, X).prod(axis=2)
6160

6161
    posterior = prior * likelihood
6162
    normalize(posterior)
6163
    return posterior
6164
\end{code}
6165

6166
Here are the updates for the control and treatment groups:
6167

6168
\begin{code}
6169
data = responses['Control']
6170
posterior_control = update_norm(prior, data)
6171

6172
data = responses['Treated']
6173
posterior_treated = update_norm(prior, data)
6174
\end{code}
6175

6176
Figure~\ref{fig11-02} shows what the joint posterior distributions look like.
6177

6178
\begin{figure}
6179
\centerline{\includegraphics[width=4in]{figs/fig11-02.pdf}}
6180
\caption{Joint posterior distributions for the treated and control groups.}
6181
\label{fig11-02}
6182
\end{figure}
6183

6184
Along the vertical axis, it looks like the mean score for the treated
6185
group is higher. Along the horizontal axis, it looks like the standard
6186
deviation for the control group is higher.
6187

6188
If we think the treatment causes these differences, the data suggest
6189
that the treatment increases the mean score and decreases their spread.
6190
We can see these differences more clearly by looking at the marginal
6191
distributions for \py{mu} and
6192
\py{sigma}.
6193

6194
\section{Posterior marginal distributions}
6195

6196
I'll use \py{marginal}, which we saw in Section~\ref{marginals},
6197
to extract the posterior marginal distributions for the population means.
6198

6199
\begin{code}
6200
from utils import marginal
6201

6202
pmf_mean_control = marginal(posterior_control, 1)
6203
pmf_mean_treated = marginal(posterior_treated, 1)
6204
\end{code}
6205

6206
Figure~\ref{fig11-03} shows what they look like.
6207
It seems like we are pretty sure that the population mean in the treated
6208
group is higher.
6209

6210
\begin{figure}
6211
\centerline{\includegraphics[width=4in]{figs/fig11-03.pdf}}
6212
\caption{}
6213
\label{fig11-03}
6214
\end{figure}
6215

6216
We can use \py{prob_gt} to
6217
compute the probability of superiority:
6218

6219
\begin{code}
6220
Pmf.prob_gt(pmf_mean_treated, pmf_mean_control)
6221
\end{code}
6222

6223
There is a 98\% chance that the mean in the treated group is higher.
6224

6225
We can use \py{sub_dist} to compute the
6226
distribution of the difference.
6227

6228
\begin{code}
6229
diff = Pmf.sub_dist(pmf_mean_treated, pmf_mean_control)
6230
\end{code}
6231

6232
But there are two things to be careful about when we use methods like
6233
\py{sub_dist}.
6234

6235
The first is that the result usually contains more elements than the
6236
original \py{Pmf}.
6237
In this example, the original distributions have the same quantities, so
6238
the size increase is moderate.
6239
But in the worst case, the size of the result can be the product of the
6240
sizes of the originals.
6241

6242
The other thing to be aware of is that plotting a
6243
\py{Pmf} does not always work well. In this
6244
example, if we plot the distribution of differences, the result is
6245
pretty noisy.
6246

6247
There are two ways to work around that limitation. One is to plot the
6248
CDF, which smooths out the noise.
6249

6250
The other option is to use kernel density estimation (KDE) to make a
6251
smooth approximation of the PDF on an equally-spaced grid.
6252
The following function takes a \py{Pmf} and the number of points on the grid, and returns a smooth \py{Pmf}, ready for plotting.
6253

6254
\begin{code}
6255
from scipy.stats import gaussian_kde
6256

6257
def make_kde(pmf, n=101):
6258
    kde = gaussian_kde(pmf.qs, weights=pmf.ps)
6259
    qs = np.linspace(pmf.qs.min(), pmf.qs.max(), n)
6260
    ps = kde.evaluate(qs)
6261
    pmf = Pmf(ps, qs)
6262
    pmf.normalize()
6263
    return pmf
6264
\end{code}
6265

6266
Figure~\ref{fig11-04} shows what it looks like.
6267
The mean is almost 10 points, which is substantial.
6268

6269
Finally, we can use \py{credible_interval} to
6270
compute a 90\% credible interval.
6271

6272
\begin{code}
6273
diff.credible_interval(0.9)
6274
\end{code}
6275

6276
Based on the data, we are pretty sure the treatment improves test scores
6277
by 2.4 to 17.4 points.
6278

6279
\section{Using summary statistics}
6280

6281
In this example the dataset is not very big, so it doesn't take too long
6282
to compute the probability of every score under every hypothesis. But
6283
the result is a 3-D array; for larger datasets, it might be too big to
6284
compute practically.
6285

6286
Also, with larger datasets the likelihoods get very small, sometimes so
6287
small that we can't compute them with normal floating-point arithmetic.
6288
That's because we are computing the probability of a particular dataset;
6289
the number of possible datasets is astronomically big, so the
6290
probability of any of them is very small.
6291

6292
An alternative is to compute a summary of the dataset and compute the
6293
likelihood of the summary. For example, if we compute the sample mean of
6294
the data and the sample standard deviation, we could compute the
6295
likelihood of those summary statistics under each hypothesis.
6296

6297
As an example, suppose we know that the population mean is 40 and the
6298
standard deviation is 17. We can make a \py{norm}
6299
object that represents a normal distribution with these parameters:
6300

6301
\begin{code}
6302
mu = 40
6303
sigma = 17
6304
dist = norm(mu, sigma)
6305
\end{code}
6306

6307
Now suppose we draw 1000 samples from this distribution with sample size
6308
\py{n=20}. I'll use \py{rvs},
6309
which generates a random sample, to simulate this experiment.
6310

6311
\begin{code}
6312
n = 20
6313
samples = dist.rvs((1000, n))
6314
samples.shape
6315
\end{code}
6316

6317
The result is an array with 1000 rows, each containing a sample with 20
6318
columns.
6319

6320
If we compute the mean of each row, the result is an array that contains
6321
1000 sample means; that is, each value is the mean of a sample with
6322
\py{n=20}.
6323

6324
\begin{code}
6325
sample_means = samples.mean(axis=1)
6326
sample_means.shape
6327
\end{code}
6328

6329
Now, we would like to know what the distribution of these sample means
6330
is. Using the properties of the normal distribution,
6331
(\url{https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables}) we
6332
can show that their distribution is normal with mean $\mu$ and
6333
standard deviation $\sigma/\sqrt{n}$:
6334

6335
\begin{code}
6336
dist_m = norm(mu, sigma/np.sqrt(n))
6337
\end{code}
6338

6339
\py{dist_m} represents the ``sampling distribution
6340
of the mean''.
6341
In the notebook for this chapter, you'll see that the random sample means follow the theoretical
6342
distribution closely, as expected.
6343

6344
We can also compute standard deviations for each row in
6345
\py{samples}.
6346

6347
\begin{code}
6348
sample_stds = samples.std(axis=1)
6349
sample_stds.shape
6350
\end{code}
6351

6352
The result is an array of sample standard deviations. We might wonder
6353
what the distribution of these values is. The
6354
(\url{https://en.wikipedia.org/wiki/Normal_distribution\#Sample_variance}) derivation
6355
is not as easy, but if we transform the sample standard deviations like
6356
this:
6357

6358
$t = n s^2 / \sigma^2$
6359

6360
where $n$ is the sample size, $s$ is the sample standard deviation,
6361
and $\sigma$ is the population standard deviation, the transformed
6362
values follow a
6363
(\url{https://en.wikipedia.org/wiki/Chi-square_distribution}) chi-square
6364
distribution with $n-1$ degrees of freedom.
6365

6366
Here are the transformed values.
6367

6368
\begin{code}
6369
transformed = n * sample_stds**2 / sigma**2
6370
\end{code}
6371

6372
And I'll create a \py{chi2} object that represents
6373
a chi-square distribution.
6374

6375
\begin{code}
6376
from scipy.stats import chi2
6377

6378
dist_s = chi2(n-1)
6379
\end{code}
6380

6381
In the notebook you'll see that the distribution of transformed sample standard deviations agrees with
6382
the theoretical distribution.
6383

6384
I think it is useful to check theoretical results like this, for a few
6385
reasons:
6386

6387
\begin{itemize}
6388
\item
6389
  It confirms that my understanding of the theory is correct,
6390

6391
\item
6392
  It confirms that the conditions where I am applying the theory are
6393
  conditions where the theory holds,
6394

6395
\item
6396
  It confirms that the implementation details are correct. For many
6397
  distributions, there is more than one way to specify the parameters.
6398
  If you use the wrong specification, this kind of testing will help you
6399
  catch the error.
6400
\end{itemize}
6401

6402
Before we move on, I'll mention one other theoretical result we will
6403
use: (\url{https://en.wikipedia.org/wiki/Basu\%27s_theorem})
6404
Basu's theorem, which states that the sample mean and sample standard
6405
deviation are independent.
6406

6407

6408
\section{Update with summary statistics}
6409

6410
Now we're ready to do an update. I'll compute summary statistics for the
6411
two groups.
6412

6413
\begin{code}
6414
summary = {}
6415
for name, response in responses.items():
6416
    summary[name] = (len(response),
6417
                     response.mean(),
6418
                     response.std())
6419
\end{code}
6420

6421
The result is a dictionary that maps from group name to a tuple that
6422
contains the sample size, \py{n}, the sample mean,
6423
\py{m}, and the sample standard deviation
6424
\py{s}, for each group.
6425

6426
I'll demonstrate the update with the summary statistics from the control
6427
group.
6428

6429
\begin{code}
6430
n, m, s = summary['Control']
6431
\end{code}
6432

6433
I'll make a mesh with hypothetical values of
6434
\py{mu} on the vertical axis and values of
6435
\py{sigma} on the horizontal axis.
6436

6437
\begin{code}
6438
sigmas, mus = np.meshgrid(prior.columns, prior.index)
6439
sigmas.shape
6440
\end{code}
6441

6442
Now we can compute the likelihood of seeing the sample mean,
6443
\py{m}, for each pair of parameters.
6444

6445
\begin{code}
6446
like1 = norm.pdf(m, mus, sigmas/np.sqrt(n))
6447
\end{code}
6448

6449
And use it to update the prior.
6450

6451
\begin{code}
6452
posterior1 = prior * like1
6453
normalize(posterior1)
6454
\end{code}
6455

6456
Next we compute the likelihood of seeing the sample standard deviation, \py{s}, for each pair of parameters.
6457

6458
\begin{code}
6459
like2 = chi2.pdf(n * s**2 / sigmas**2, n-1)
6460
\end{code}
6461

6462
And here's the second update:
6463

6464
\begin{code}
6465
posterior2 = posterior1 * like2
6466
normalize(posterior2)
6467
\end{code}
6468

6469
The following function does both updates, using the sample mean and
6470
standard deviation.
6471

6472
\begin{code}
6473
def update_norm_summary(prior, data):
6474
    n, m, s = data
6475
    sigmas, mus = np.meshgrid(prior.columns, prior.index)
6476

6477
    like1 = norm.pdf(m, mus, sigmas/np.sqrt(n))
6478
    like2 = chi2.pdf(n * s**2 / sigmas**2, n-1)
6479

6480
    posterior = prior * like1 * like2
6481
    normalize(posterior)
6482

6483
    return posterior
6484
\end{code}
6485

6486
Here are the updates for the two groups.
6487

6488
\begin{code}
6489
data = summary['Control']
6490
posterior_control2 = update_norm_summary(prior, data)
6491

6492
data = summary['Treated']
6493
posterior_treated2 = update_norm_summary(prior, data)
6494
\end{code}
6495

6496
You can see the results in the notebook for this chapter.
6497
Visually, these posterior joint distributions are similar to the ones we
6498
computed using the entire datasets, not just the summary statistics.
6499
But they are not exactly the same, as we'll see by comparing the marginal
6500
distributions.
6501

6502
\section{Comparing marginals}
6503

6504
Again, let's extract the marginal posterior distributions.
6505

6506
\begin{code}
6507
pmf_mean_control2 = marginal(posterior_control2, 1)
6508
pmf_mean_treated2 = marginal(posterior_treated2, 1)
6509
\end{code}
6510

6511
And compare them to results we got using the entire dataset.
6512
Figure~\ref{fig11-05} shows the results.
6513

6514
\begin{figure}
6515
\centerline{\includegraphics[width=4in]{figs/fig11-05.pdf}}
6516
\caption{}
6517
\label{fig11-05}
6518
\end{figure}
6519

6520
For both groups, the distribution of \py{mu} is a little wider when we use only the summary statistics; that is, we are a little less certain about the values of the means.
6521

6522
If we compute the posterior distribution of the difference in means,
6523
the mean difference is nearly the same, but the credible interval is a bit wider.
6524

6525
That's because the update we did is based on the implicit assumption
6526
that the distribution of the data is actually normal, but it's not.
6527
As a result, when we replace the dataset with the summary statistics, we lose some information about the true distribution of the data. With less
6528
information, we are less certain about the parameters.
6529

6530
\section{Summary}
6531

6532
In this chapter we used a joint distribution to represent prior
6533
probabilities for the parameters of a normal distribution,
6534
\py{mu} and \py{sigma}.
6535

6536
And we updated that distribution two ways: first using the entire
6537
dataset and the normal PDF; then using summary statistics, the normal
6538
PDF, and the chi-square PDF.
6539

6540
Using summary statistics is computationally more efficient, but it loses
6541
some information in the process.
6542

6543
Normal distributions appear in many domains, as well as other
6544
distributions that are well approximated by normal distributions. So the
6545
methods in this chapter are broadly applicable. The exercises at the end
6546
of the chapter will give you a chance to apply them.
6547

6548
\section{Exercises}
6549

6550
\begin{exercise}
6551
Looking again at the posterior joint distribution of
6552
\py{mu} and \py{sigma}, it
6553
seems like the standard deviation of the treated group might be lower;
6554
if so, that would suggest that the treatment is more effective for
6555
students with lower scores.
6556

6557
But before we speculate too much, we should estimate the size of the
6558
difference and see whether it might actually be 0.
6559

6560
As we did with the values of \py{mu} in the
6561
previous section, extract the posterior marginal distributions of
6562
\py{sigma} for the two groups. What is the
6563
probability that the standard deviation is higher in the control group?
6564

6565
Compute the distribution of the difference in
6566
\py{sigma} between the two groups. What is the mean
6567
of this difference? What is the 90\% credible interval?
6568

6569
\end{exercise}
6570

6571

6572
\begin{exercise}
6573
An ``effect size'' is a statistic intended to quantify the magnitude of a phenomenon (see \url{http://en.wikipedia.org/wiki/Effect_size}).
6574
If the phenomenon is a difference in means between two groups, a common way to quantify it is Cohen's effect size, denoted $d$.
6575

6576
If the parameters for Group 1 are $(\mu_1, \sigma_1)$, and the
6577
parameters for Group 2 are $(\mu_2, \sigma_2)$, Cohen's
6578
effect size is
6579
%
6580
\[ d = \frac{\mu_1 - \mu_2}{(\sigma_1 + \sigma_2)/2} \]
6581
%
6582
Use the joint posterior distributions for the two groups to compute the posterior distribution for Cohen's effect size.
6583
Then compute the mean and 90\% credible interval.
6584

6585
Hint: if enumerating all pairs from the two distributions takes too
6586
long, consider random sampling.
6587
\end{exercise}
6588

6589

6590
\begin{exercise}
6591
This exercise is inspired by
6592
(\url{https://www.reddit.com/r/statistics/comments/hcvl2j/q_reverse_empirical_distribution_rule_question/}) a
6593
question that appeared on Reddit.
6594

6595
An instructor announces the results of an exam like this, ``The average
6596
score on this exam was 81. Out of 25 students, 5 got more than 90, and I
6597
am happy to report that no one failed (got less than 60).''
6598

6599
Based on this information, what do you think the standard deviation of
6600
scores was?
6601

6602
You can assume that the distribution of scores is approximately normal.
6603
And let's assume that the sample mean, 81, is actually the population
6604
mean, so we only have to estimate \py{sigma}.
6605

6606
Hint: To compute the probability of a score greater than 90, you can use
6607
\py{norm.sf}, which computes the survival function,
6608
also known as the complementary CDF, or
6609
\py{1 - cdf(x)}.
6610

6611
\end{exercise}
6612

6613

6614
\begin{exercise}
6615
I have a soft spot for crank science, so this
6616
exercise is about the
6617
\url{http://en.wikipedia.org/wiki/Variability_hypothesis}{Variability
6618
Hypothesis}, which
6619

6620
\begin{quote}
6621
``originated in the early nineteenth century with Johann Meckel, who
6622
argued that males have a greater range of ability than females,
6623
especially in intelligence. In other words, he believed that most
6624
geniuses and most mentally retarded people are men. Because he
6625
considered males to be the 'superior animal,' Meckel concluded that
6626
females' lack of variation was a sign of inferiority.''
6627
\end{quote}
6628

6629
I particularly like that last part because I suspect that if it turned
6630
out that women were \emph{more} variable, Meckel would have taken that
6631
as a sign of inferiority, too.
6632

6633
Nevertheless, the Variability Hypothesis suggests an exercise we can use
6634
to practice the methods in this chapter. Let's look at the distribution
6635
of heights for men and women in the U.S. and see who is more variable.
6636

6637
I used 2018 data from the CDC's
6638
\url{https://www.cdc.gov/brfss/annual_data/annual_2018.html}{Behavioral
6639
Risk Factor Surveillance System} (BRFSS), which includes self-reported
6640
heights from 154407 men and 254722 women.
6641

6642
Here's what I found:
6643

6644
\begin{itemize}
6645
\item
6646
  The average height for men is 178 cm; the average height for women is
6647
  163 cm. So men are taller on average; no surprise there.
6648
\item
6649
  For men the standard deviation is 8.27 cm; for women it is 7.75 cm. So
6650
  in absolute terms, men's heights are more variable.
6651
\end{itemize}
6652

6653
But to compare variability between groups, it is more meaningful to use
6654
the
6655
(\url{https://en.wikipedia.org/wiki/Coefficient_of_variation}) coefficient
6656
of variation (CV), which is the standard deviation divided by the mean.
6657
It is a dimensionless measure of variability relative to scale.
6658

6659
For men CV is 0.0465; for women it is 0.0475. The coefficient of
6660
variation is higher for women, so this dataset provides evidence against
6661
the Variability Hypothesis. But we can use Bayesian methods to make that
6662
conclusion more precise.
6663

6664
Use these summary statistics to compute the posterior distribution of
6665
\py{mu} and \py{sigma} for the
6666
distributions of male and female height. Use
6667
\py{Pmf.div_dist} to compute posterior
6668
distributions of CV. Based on this dataset and the assumption that the
6669
distribution of height is normal, what is the probability that the
6670
coefficient of variation is higher for men? What is the most likely
6671
ratio of the CVs and what is the 90\% credible interval for that ratio?
6672

6673
Hint: Use different prior distributions for the two groups, and chose
6674
them so they cover all parameters with non-negligible probability.
6675

6676
\end{exercise}
6677

6678

6679
\chapter{Observer Bias}
6680
\label{observer}
6681

6682
\section{The Red Line problem}
6683

6684
In Massachusetts, the Red Line is a subway that connects
6685
Cambridge and Boston.  When I was working in Cambridge I took the Red
6686
Line from Kendall Square to South Station and caught the commuter rail
6687
to Needham.  During rush hour Red Line trains run every 7--8
6688
minutes, on average.
6689
\index{Red Line problem}
6690
\index{Boston}
6691

6692
When I arrived at the station, I could estimate the time until
6693
the next train based on the number of passengers on the platform.
6694
If there were only a few people, I inferred that I just missed
6695
a train and expected to wait about 7 minutes.  If there were
6696
more passengers, I expected the train to arrive sooner.  But if
6697
there were a large number of passengers, I suspected that
6698
trains were not running on schedule, so I would go back to the
6699
street level and get a taxi.
6700

6701
While I was waiting for trains, I thought about how Bayesian
6702
estimation could help predict my wait time and decide when I
6703
should give up and take a taxi.  This chapter presents the
6704
analysis I came up with.
6705

6706
This chapter is based on a project by Brendan Ritter and
6707
Kai Austin, who took a class with me at Olin College.
6708
The code in this chapter is available from
6709
\url{http://thinkbayes.com/redline.py}.  The code I used
6710
to collect data is in \url{http://thinkbayes.com/redline_data.py}.
6711
  For more information
6712
see Section~\ref{download}.
6713
\index{Olin College}
6714

6715

6716
\section{The model}
6717

6718
\begin{figure}
6719
% redline.py
6720
\centerline{\includegraphics[height=2.5in]{figs/redline0.pdf}}
6721
\caption{PMF of gaps between trains, based on collected data,
6722
smoothed by KDE.  \py{z} is the actual distribution; \py{zb}
6723
is the biased distribution seen by passengers. }
6724
\label{fig.redline0}
6725
\end{figure}
6726

6727
Before we get to the analysis, we have to make some
6728
modeling decisions.  First, I will treat passenger arrivals as
6729
a Poisson process, which means I assume that passengers are equally
6730
likely to arrive at any time, and that they arrive at an unknown
6731
rate, $\lam$, measured in passengers per minute.  Since I
6732
observe passengers during a short period of time, and at the same
6733
time every day, I assume that $\lam$ is constant.
6734
\index{Poisson process}
6735

6736
On the other hand, the arrival process for trains is not Poisson.
6737
Trains to Boston are supposed to leave from the end of the line
6738
(Alewife station) every 7--8 minutes during peak times, but by the time
6739
they get to Kendall Square, the time between trains varies between 3
6740
and 12 minutes.
6741

6742
To gather data on the time between trains, I wrote a script that
6743
downloads real-time data from
6744
\url{http://www.mbta.com/rider_tools/developers/}, selects south-bound
6745
trains arriving at Kendall square, and records their arrival times
6746
in a database.  I ran the script from 4pm to 6pm every weekday
6747
for 5 days, and recorded about 15 arrivals per day.  Then
6748
I computed the time between consecutive arrivals; the distribution
6749
of these gaps is shown in Figure~\ref{fig.redline0}, labeled \py{z}.
6750

6751
If you stood on the platform from 4pm to 6pm and recorded the time
6752
between trains, this is the distribution you would see.  But if you
6753
arrive at some random time (without regard to the train schedule) you
6754
would see a different distribution.  The average time
6755
between trains, as seen by a random passenger, is substantially
6756
higher than the true average.
6757

6758
Why?  Because a passenger is more like to arrive during a
6759
large interval than a small one.  Consider a simple example:
6760
suppose that the time between trains is either 5 minutes
6761
or 10 minutes with equal probability.  In that case
6762
the average time between
6763
trains is 7.5 minutes.
6764

6765
But a passenger is more likely to arrive during a 10 minute gap
6766
than a 5 minute gap; in fact, twice as likely.  If we surveyed
6767
arriving passengers, we would find that 2/3 of them arrived during
6768
a 10 minute gap, and only 1/3 during a 5 minute gap.  So the
6769
average time between trains, as seen by an arriving passenger,
6770
is 8.33 minutes.
6771

6772
This kind of {\bf observer bias} appears in many contexts.  Students
6773
think that classes are bigger than they are because more of them are
6774
in the big classes.  Airline passengers think that planes are fuller
6775
than they are because more of them are on full flights.
6776
\index{observer bias}
6777

6778
In each case, values from the actual distribution are
6779
oversampled in proportion to their value.  In the Red Line example,
6780
a gap that is twice as big is twice as likely to be observed.
6781

6782
So given the actual distribution of gaps, we can compute the
6783
distribution of gaps as seen by passengers.  \py{BiasPmf}
6784
does this computation:
6785

6786
\begin{code}
6787
def BiasPmf(pmf):
6788
    new_pmf = pmf.Copy()
6789

6790
    for x, p in pmf.Items():
6791
        new_pmf.Mult(x, x)
6792

6793
    new_pmf.Normalize()
6794
    return new_pmf
6795
\end{code}
6796

6797
\py{pmf} is the actual distribution; \verb"new_pmf" is the
6798
biased distribution.  Inside the loop, we multiply the
6799
probability of each value, \py{x}, by the likelihood it will
6800
be observed, which is proportional to \py{x}.  Then we
6801
normalize the result.
6802

6803
Figure~\ref{fig.redline0} shows the actual distribution of gaps,
6804
labeled \py{z}, and the distribution of gaps seen by passengers,
6805
labeled \py{zb} for ``z biased''.
6806

6807

6808
\section{Wait times}
6809

6810
\begin{figure}
6811
% redline.py
6812
\centerline{\includegraphics[height=2.5in]{figs/redline2.pdf}}
6813
\caption{CDF of \py{z}, \py{zb}, and the wait time seen
6814
by passengers, \py{y}. }
6815
\label{fig.redline2}
6816
\end{figure}
6817

6818
Wait time, which I call \py{y}, is the time between the arrival
6819
of a passenger and the next arrival of a train.  Elapsed time, which I
6820
call \py{x}, is the time between the arrival of the previous
6821
train and the arrival of a passenger.  I chose these definitions
6822
so that \py{zb = x + y}.
6823

6824
Given the distribution of \py{zb}, we can compute the distribution of
6825
\py{y}.  I'll start with a simple case and then generalize.
6826
Suppose, as in the previous example, that \py{zb} is either 5 minutes
6827
with probability 1/3, or 10 minutes with probability 2/3.
6828

6829
If we arrive at a random time during a 5 minute gap,
6830
\py{y} is uniform from 0 to 5 minutes.  If we arrive during a 10
6831
minute gap, \py{y} is uniform from 0 to 10.  So the overall
6832
distribution is a mixture of uniform distributions weighted
6833
according to the probability of each gap.
6834
\index{uniform distribution}
6835

6836
The following function takes the distribution of \py{zb} and
6837
computes the distribution of \py{y}:
6838

6839
\begin{code}
6840
def PmfOfWaitTime(pmf_zb):
6841
    metapmf = thinkbayes.Pmf()
6842
    for gap, prob in pmf_zb.Items():
6843
        uniform = MakeUniformPmf(0, gap)
6844
        metapmf.Set(uniform, prob)
6845

6846
    pmf_y = thinkbayes.MakeMixture(metapmf)
6847
    return pmf_y
6848
\end{code}
6849

6850
\py{PmfOfWaitTime} makes a meta-Pmf that maps from each uniform
6851
distribution to its probability.  Then it uses \py{MakeMixture},
6852
which we saw in Section~\ref{mixture}, to compute the mixture.
6853
\index{mixture}
6854
\index{MakeMixture}
6855
\index{meta-Pmf}
6856

6857
\py{PmfOfWaitTime} also uses \py{MakeUniformPmf}, defined here:
6858

6859
\begin{code}
6860
def MakeUniformPmf(low, high):
6861
    pmf = thinkbayes.Pmf()
6862
    for x in MakeRange(low=low, high=high):
6863
        pmf.Set(x, 1)
6864
    pmf.Normalize()
6865
    return pmf
6866
\end{code}
6867

6868
\py{low} and \py{high} are the range of the uniform distribution,
6869
(both ends included).  Finally, \py{MakeUniformPmf} uses {\tt
6870
  MakeRange}, defined here:
6871

6872
\begin{code}
6873
def MakeRange(low, high, skip=10):
6874
    return range(low, high+skip, skip)
6875
\end{code}
6876

6877
\py{MakeRange} defines a set of possible values for wait time
6878
(expressed in seconds).  By default it divides the range into
6879
10 second intervals.
6880

6881
To encapsulate the process of computing these distributions, I
6882
created a class called \py{WaitTimeCalculator}:
6883

6884
\begin{code}
6885
class WaitTimeCalculator(object):
6886

6887
    def __init__(self, pmf_z):
6888
        self.pmf_z = pmf_z
6889
        self.pmf_zb = BiasPmf(pmf)
6890

6891
        self.pmf_y = self.PmfOfWaitTime(self.pmf_zb)
6892
        self.pmf_x = self.pmf_y
6893
\end{code}
6894

6895
The parameter, \verb"pmf_z", is the unbiased distribution of \py{z}.
6896
\verb"pmf_zb" is the biased distribution of gap time, as seen by
6897
passengers.
6898

6899
\verb"pmf_y" is the distribution of wait time.  \verb"pmf_x" is the
6900
distribution of elapsed time, which is the same as the distribution of
6901
wait time.  To see why, remember that for a particular value of
6902
\py{zp}, the distribution of \py{y} is uniform from 0 to \py{zp}.
6903
Also
6904
%
6905
\begin{code}
6906
x = zp - y
6907
\end{code}
6908
%
6909
So the distribution of \py{x} is also uniform from 0 to \py{zp}.
6910

6911
Figure~\ref{fig.redline2} shows the distribution of \py{z}, \py{zb},
6912
and \py{y} based on the data I collected from the Red Line web site.
6913

6914
To present these distributions, I am switching from Pmfs to Cdfs.
6915
Most people are more familiar with Pmfs, but I think Cdfs are easier
6916
to interpret, once you get used to them.  And if you want to plot
6917
several distributions on the same axes, Cdfs are the way to go.
6918
\index{Cdf}
6919
\index{cumulative distribution function}
6920

6921
The mean of \py{z} is 7.8 minutes.  The mean of \py{zb} is 8.8
6922
minutes, about 13\% higher.  The mean of \py{y} is 4.4, half
6923
the mean of \py{zb}.
6924

6925
As an aside, the Red Line schedule reports that trains run every
6926
9 minutes during peak times.  This is close to the average of
6927
\py{zb}, but higher than the average of \py{z}.  I exchanged email
6928
with a representative of the MBTA, who confirmed that the reported
6929
time between trains is deliberately conservative in order to
6930
account for variability.
6931

6932

6933
\section{Predicting wait times}
6934
\label{elapsed}
6935

6936
\begin{figure}
6937
% redline.py
6938
\centerline{\includegraphics[height=2.5in]{figs/redline3.pdf}}
6939
\caption{Prior and posterior of \py{x} and predicted \py{y}. }
6940
\label{fig.redline3}
6941
\end{figure}
6942

6943
Let's get back to the motivating question: suppose that when
6944
I arrive at the platform I see 10 people waiting.
6945
How long should I expect to wait until the next train arrives?
6946

6947
As always, let's start with the easiest version of the problem
6948
and work our way up.  Suppose we are given the actual distribution of
6949
\py{z}, and we know that the passenger arrival rate,
6950
$\lam$, is 2 passengers per minute.
6951

6952
In that case we can:
6953

6954
\begin{enumerate}
6955

6956
\item Use the distribution of \py{z} to compute
6957
the prior distribution of \py{zp}, the time between trains
6958
as seen by a passenger.
6959

6960
\item Then we can use the number of passengers to estimate the distribution
6961
of \py{x}, the elapsed time since the last train.
6962

6963
\item Finally, we use the relation \py{y = zp - x} to get the
6964
distribution of \py{y}.
6965

6966
\end{enumerate}
6967

6968
The first step is to create a \py{WaitTimeCalculator} that
6969
encapsulates the distributions of \py{zp}, \py{x},
6970
and \py{y}, prior to taking into account the number of
6971
passengers.
6972

6973
\begin{code}
6974
    wtc = WaitTimeCalculator(pmf_z)
6975
\end{code}
6976

6977
\verb"pmf_z" is the given distribution of gap times.
6978

6979
The next step is to make an \py{ElapsedTimeEstimator} (defined
6980
below), which encapsulates the posterior distribution of \py{x} and
6981
the predictive distribution of \py{y}.
6982
\index{predictive distribution}
6983

6984
\begin{code}
6985
    ete = ElapsedTimeEstimator(wtc,
6986
                               lam=2.0/60,
6987
                               num_passengers=15)
6988
\end{code}
6989

6990
The parameters are the \py{WaitTimeCalculator}, the passenger
6991
arrival rate, \py{lam} (expressed in passengers per second),
6992
and the observed number of passengers, let's say 15.
6993

6994
Here is the definition of \py{ElapsedTimeEstimator}:
6995

6996
\begin{code}
6997
class ElapsedTimeEstimator(object):
6998

6999
    def __init__(self, wtc, lam, num_passengers):
7000
        self.prior_x = Elapsed(wtc.pmf_x)
7001

7002
        self.post_x = self.prior_x.Copy()
7003
        self.post_x.Update((lam, num_passengers))
7004

7005
        self.pmf_y = PredictWaitTime(wtc.pmf_zb, self.post_x)
7006
\end{code}
7007

7008
\verb"prior_x" and \verb"posterior_x" are the prior and
7009
posterior distributions of elapsed time.  \verb"pmf_y" is
7010
the predictive distribution of wait time.
7011

7012
\py{ElapsedTimeEstimator} uses \py{Elapsed} and \py{PredictWaitTime},
7013
defined below.
7014

7015
\py{Elapsed} is a Suite that represents the hypothetical
7016
distribution of \py{x}.  The prior distribution of \py{x}
7017
comes straight from the \py{WaitTimeCalculator}.  Then we
7018
use the data, which consists of the arrival rate, \py{lam},
7019
and the number of passengers on the platform, to compute
7020
the posterior distribution.
7021

7022
Here's the definition of \py{Elapsed}:
7023

7024
\begin{code}
7025
class Elapsed(thinkbayes.Suite):
7026

7027
    def Likelihood(self, data, hypo):
7028
        x = hypo
7029
        lam, k = data
7030
        like = thinkbayes.EvalPoissonPmf(k, lam * x)
7031
        return like
7032
\end{code}
7033

7034
As always, \py{Likelihood} takes a hypothesis and data, and
7035
computes the likelihood of the data under the hypothesis.
7036
In this case \py{hypo} is the elapsed time since the last train
7037
and \py{data} is a tuple of \py{lam} and the number of
7038
passengers.
7039
\index{likelihood}
7040

7041
The likelihood of the data is the probability of getting
7042
\py{k} arrivals in \py{x} time, given arrival rate
7043
\py{lam}.  We compute that using the PMF of the Poisson
7044
distribution.
7045
\index{Poisson distribution}
7046

7047
Finally, here's the definition of \py{PredictWaitTime}:
7048

7049
\begin{code}
7050
def PredictWaitTime(pmf_zb, pmf_x):
7051
    pmf_y = pmf_zb - pmf_x
7052
    RemoveNegatives(pmf_y)
7053
    return pmf_y
7054
\end{code}
7055

7056
\verb"pmf_zb" is the distribution of gaps between trains;
7057
\verb"pmf_x" is the distribution of elapsed time, based on
7058
the observed number of passengers.  Since \py{y = zb - x},
7059
we can compute
7060

7061
\begin{code}
7062
    pmf_y = pmf_zb - pmf_x
7063
\end{code}
7064

7065
The subtraction operator invokes \verb"Pmf.__sub__", which enumerates
7066
all pairs of \py{zb} and \py{x}, computes the differences, and adds
7067
the results to \verb"pmf_y".
7068

7069
The resulting Pmf includes some negative values, which we know are
7070
impossible.  For example, if you arrive during a gap of 5 minutes, you
7071
can't wait more than 5 minutes.  \py{RemoveNegatives} removes the
7072
impossible values from the distribution and renormalizes.
7073

7074
\begin{code}
7075
def RemoveNegatives(pmf):
7076
    for val in pmf.Values():
7077
        if val < 0:
7078
            pmf.Remove(val)
7079
    pmf.Normalize()
7080
\end{code}
7081

7082
Figure~\ref{fig.redline3} shows the results.  The prior distribution
7083
of \py{x} is the same as the distribution of \py{y} in
7084
Figure~\ref{fig.redline2}.  The posterior distribution of \py{x}
7085
shows that, after seeing 15 passengers on the platform, we believe
7086
that the time since the last train is probably 5-10 minutes.  The
7087
predictive distribution of \py{y} indicates that we expect the next
7088
train in less than 5 minutes, with about 80\% confidence.
7089
\index{predictive distribution}
7090

7091

7092
\section{Estimating the arrival rate}
7093

7094
\begin{figure}
7095
% redline.py
7096
\centerline{\includegraphics[height=2.5in]{figs/redline1.pdf}}
7097
\caption{Prior and posterior distributions of \py{lam} based
7098
on five days of passenger data. }
7099
\label{fig.redline1}
7100
\end{figure}
7101

7102
The analysis so far has been based on the assumption that we know (1)
7103
the distribution of gaps and (2) the passenger arrival rate.  Now we
7104
are ready to relax the second assumption.
7105

7106
Suppose that you just moved to Boston, so you don't know much about
7107
the passenger arrival rate on the Red Line.  After a few days of
7108
commuting, you could make a guess, at least qualitatively.  With
7109
a little more effort, you could estimate $\lam$ quantitatively.
7110
\index{arrival rate}
7111

7112
Each day when you arrive at the platform, you should note the
7113
time and the number of passengers waiting (if the platform is too
7114
big, you could choose a sample area).  Then you should record your
7115
wait time and the
7116
number of new arrivals while you are waiting.
7117

7118
After five days, you might have data like this:
7119
%
7120
\begin{code}
7121
k1      y     k2
7122
--     ---    --
7123
17     4.6     9
7124
22     1.0     0
7125
23     1.4     4
7126
18     5.4    12
7127
4      5.8    11
7128
\end{code}
7129
%
7130
where \py{k1} is the number of passengers waiting when you arrive,
7131
\py{y} is your wait time in minutes, and \py{k2} is the number of
7132
passengers who arrive while you are waiting.
7133

7134
Over the course of one week, you waited 18 minutes and saw 36
7135
passengers arrive, so you would estimate that the arrival rate is
7136
2 passengers per minute.  For practical purposes that estimate is
7137
good enough, but for the sake of completeness I
7138
will compute a posterior distribution for $\lam$ and show how
7139
to use that distribution in the rest of the analysis.
7140

7141
\py{ArrivalRate} is a \py{Suite} that represents hypotheses about
7142
$\lam$.  As always, \py{Likelihood} takes a hypothesis and data,
7143
and computes the likelihood of the data under the hypothesis.
7144

7145
In this case the hypothesis is a value of $\lam$.  The data is a
7146
pair, \py{y, k}, where \py{y} is a wait time and \py{k} is the
7147
number of passengers that arrived.
7148

7149
\begin{code}
7150
class ArrivalRate(thinkbayes.Suite):
7151

7152
    def Likelihood(self, data, hypo):
7153
        lam = hypo
7154
        y, k = data
7155
        like = thinkbayes.EvalPoissonPmf(k, lam * y)
7156
        return like
7157
\end{code}
7158

7159
This \py{Likelihood} might look familiar; it
7160
is almost identical to \py{Elapsed.Likelihood} in
7161
Section~\ref{elapsed}.  The difference is that in {\tt
7162
  Elapsed.Likelihood} the hypothesis is \py{x}, the elapsed time; in
7163
\py{ArrivalRate.Likelihood} the hypothesis is \py{lam}, the arrival
7164
rate.  But in both cases the likelihood is the probability of seeing
7165
\py{k} arrivals in some period of time, given \py{lam}.
7166

7167
\py{ArrivalRateEstimator} encapsulates the process of estimating
7168
$\lam$.  The parameter, \verb"passenger_data", is a list
7169
of \py{k1, y, k2} tuples, as in the table above.
7170
\index{numpy}
7171

7172
\begin{code}
7173
class ArrivalRateEstimator(object):
7174

7175
    def __init__(self, passenger_data):
7176
        low, high = 0, 5
7177
        n = 51
7178
        hypos = numpy.linspace(low, high, n) / 60
7179

7180
        self.prior_lam = ArrivalRate(hypos)
7181

7182
        self.post_lam = self.prior_lam.Copy()
7183
        for k1, y, k2 in passenger_data:
7184
            self.post_lam.Update((y, k2))
7185
\end{code}
7186

7187
\verb"__init__" builds
7188
\py{hypos}, which is a sequence of hypothetical values for \py{lam},
7189
then builds the prior distribution, \verb"prior_lam".
7190
The \py{for} loop updates the prior with data, yielding the posterior
7191
distribution, \verb"post_lam".
7192

7193
Figure~\ref{fig.redline1} shows
7194
the prior and posterior distributions.  As expected, the mean and
7195
median of the posterior are near the observed rate, 2 passengers per
7196
minute.  But the spread of the posterior distribution captures our
7197
uncertainty about $\lam$ based on a small sample.
7198

7199

7200
\section{Incorporating uncertainty}
7201

7202
\begin{figure}
7203
% redline.py
7204
\centerline{\includegraphics[height=2.5in]{figs/redline4.pdf}}
7205
\caption{Predictive distributions of \py{y} for possible values
7206
  of \py{lam}. }
7207
\label{fig.redline4}
7208
\end{figure}
7209

7210
Whenever there is uncertainty about one of the inputs to an analysis,
7211
we can take it into account by a process like this:
7212
\index{uncertainty}
7213

7214
\begin{enumerate}
7215

7216
\item Implement the analysis based on a deterministic value of the
7217
  uncertain parameter (in this case $\lam$).
7218

7219
\item Compute the distribution of the uncertain parameter.
7220

7221
\item Run the analysis for each value of the parameter, and generate a
7222
  set of predictive distributions.
7223
\index{predictive distribution}
7224

7225
\item Compute a mixture of the predictive distributions, using the
7226
  weights from the distribution of the parameter.
7227
\index{mixture}
7228

7229
\end{enumerate}
7230

7231
We have already done steps (1) and (2).  I wrote a class
7232
called \py{WaitMixtureEstimator} to handle steps (3) and (4).
7233

7234
\begin{code}
7235
class WaitMixtureEstimator(object):
7236

7237
    def __init__(self, wtc, are, num_passengers=15):
7238
        self.metapmf = thinkbayes.Pmf()
7239

7240
        for lam, prob in sorted(are.post_lam.Items()):
7241
            ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
7242
            self.metapmf.Set(ete.pmf_y, prob)
7243

7244
        self.mixture = thinkbayes.MakeMixture(self.metapmf)
7245
\end{code}
7246

7247
\py{wtc} is the \py{WaitTimeCalculator} that contains the
7248
distribution of \py{zb}.  \py{are} is the \py{ArrivalTimeEstimator}
7249
that contains the distribution of \py{lam}.
7250

7251
The first line makes a meta-Pmf that maps from each possible
7252
distribution of \py{y} to its probability.  For each value
7253
of \py{lam}, we use \py{ElapsedTimeEstimator} to
7254
compute the corresponding distribution of
7255
\py{y} and store it in the Meta-Pmf.  Then
7256
we use \py{MakeMixture} to compute the mixture.
7257
\index{MakeMixture}
7258
\index{meta-Pmf}
7259
\index{mixture}
7260

7261
%For purposes of comparison, I also compute the distribution of
7262
%\py{y} based on a single point estimate of \py{lam}, which is
7263
%the mean of the posterior distribution.
7264

7265
Figure~\ref{fig.redline4} shows the results.  The shaded lines
7266
in the background are the distributions of \py{y} for each value
7267
of \py{lam}, with line thickness that represents likelihood.
7268
The dark line is the mixture of these distributions.
7269

7270
In this case we could get a very similar result using a single point
7271
estimate of \py{lam}.  So it was not necessary, for practical purposes,
7272
to include the uncertainty of the estimate.
7273

7274
In general, it is important to include variability if the system
7275
response is non-linear; that is, if small changes in the input can
7276
cause big changes in the output.  In this case, posterior variability
7277
in \py{lam} is small and the system response is approximately
7278
linear for small perturbations.
7279
\index{non-linear}
7280

7281

7282
\section{Decision analysis}
7283

7284
\begin{figure}
7285
% redline.py
7286
\centerline{\includegraphics[height=2.5in]{figs/redline5.pdf}}
7287
\caption{Probability that wait time exceeds 15 minutes as
7288
a function of the number of passengers on the platform. }
7289
\label{fig.redline5}
7290
\end{figure}
7291

7292
At this point we can use the number of passengers on the platform
7293
to predict the distribution of wait times.  Now
7294
let's get to the second part of the question: when should I stop
7295
waiting for the train and go catch a taxi?
7296
\index{decision analysis}
7297

7298
Remember that in the original scenario, I am trying to get to
7299
South Station to catch the commuter rail.  Suppose I leave
7300
the office with enough time that I can wait 15 minutes
7301
and still make my connection at South Station.
7302

7303
In that case I would like to know the probability that \py{y} exceeds
7304
15 minutes as a function of \verb"num_passengers".  It is easy enough
7305
to use the
7306
analysis from Section~\ref{elapsed} and run it for a range of
7307
\verb"num_passengers".
7308

7309
But there's a problem.
7310
The analysis is sensitive to the frequency of long delays, and
7311
because long delays are rare, it is hard to estimate
7312
their frequency.
7313

7314
I only have data from one week,
7315
and the longest delay I observed was 15 minutes.  So I can't
7316
estimate the frequency of longer delays accurately.
7317

7318
However, I can use previous observations to make at least a coarse
7319
estimate.  When I commuted by Red Line for a year, I saw three long
7320
delays caused by a signaling problem, a power outage, and ``police
7321
activity'' at another stop.  So I estimate that there are about
7322
3 major delays per year.
7323

7324
But remember that my observations are biased.  I am more likely
7325
to observe long delays because they affect a large number
7326
of passengers.  So we should treat my observations as a sample
7327
of \py{zb} rather than \py{z}.  Here's how we can do that.
7328
\index{observer bias}
7329

7330
During my year of commuting, I took the Red Line home about 220
7331
times.  So I take the observed gap times, \verb"gap_times",
7332
generate a sample of 220 gaps, and compute their Pmf:
7333

7334
\begin{code}
7335
    n = 220
7336
    cdf_z = thinkbayes.MakeCdfFromList(gap_times)
7337
    sample_z = cdf_z.Sample(n)
7338
    pmf_z = thinkbayes.MakePmfFromList(sample_z)
7339
\end{code}
7340

7341
Next I bias \verb"pmf_z" to get the distribution of
7342
\py{zb}, draw a sample, and then add in delays of
7343
30, 40, and 50 minutes (expressed in seconds):
7344

7345
\begin{code}
7346
    cdf_zp = BiasPmf(pmf_z).MakeCdf()
7347
    sample_zb = cdf_zp.Sample(n) + [1800, 2400, 3000]
7348
\end{code}
7349

7350
\py{Cdf.Sample} is more efficient than \py{Pmf.Sample}, so it
7351
is usually faster to convert a Pmf to a Cdf before sampling.
7352

7353
Next I use the sample of \py{zb} to estimate a Pdf using
7354
KDE, and then convert the Pdf to a Pmf:
7355

7356
\begin{code}
7357
    pdf_zb = thinkbayes.EstimatedPdf(sample_zb)
7358
    xs = MakeRange(low=60)
7359
    pmf_zb = pdf_zb.MakePmf(xs)
7360
\end{code}
7361

7362
Finally I unbias the distribution of \py{zb} to get the
7363
distribution of \py{z}, which I use to create the
7364
\py{WaitTimeCalculator}:
7365

7366
\begin{code}
7367
    pmf_z = UnbiasPmf(pmf_zb)
7368
    wtc = WaitTimeCalculator(pmf_z)
7369
\end{code}
7370

7371
This process is complicated, but
7372
all of the steps are operations we have seen before.
7373
Now we are ready to compute the probability of a long wait.
7374

7375
\begin{code}
7376
def ProbLongWait(num_passengers, minutes):
7377
    ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
7378
    cdf_y = ete.pmf_y.MakeCdf()
7379
    prob = 1 - cdf_y.Prob(minutes * 60)
7380
\end{code}
7381

7382
Given the number of passengers on the platform,
7383
\py{ProbLongWait}
7384
makes an \py{ElapsedTimeEstimator},
7385
extracts the distribution of wait time, and
7386
computes
7387
the probability that wait time
7388
exceeds \py{minutes}.
7389

7390
Figure~\ref{fig.redline5} shows the result.  When the number of
7391
passengers is less than 20, we infer that the system is
7392
operating normally, so the probability of a long delay is small.
7393
If there are 30 passengers, we estimate that it has been 15
7394
minutes since the last train; that's longer than a normal delay,
7395
so we infer that something is wrong and expect longer delays.
7396

7397
If we are willing to accept a 10\% chance of missing the connection
7398
at South Station, we should stay and wait as long as there
7399
are fewer than 30 passengers, and take a taxi if there are more.
7400

7401
Or, to take this analysis one step further, we could quantify the cost
7402
of missing the connection and the cost of taking a taxi, then choose
7403
the threshold that minimizes expected cost.
7404

7405
\section{Discussion}
7406

7407
The analysis so far has been based on the assumption that the
7408
arrival rate of passengers is the same every day.  For a commuter
7409
train during rush hour, that might not be a bad assumption, but
7410
there are some obvious exceptions.  For example, if there is a special
7411
event nearby, a large number of people might arrive at the same time.
7412
In that case, the estimate of \py{lam} would be too low, so the
7413
estimates of \py{x} and \py{y} would be too high.
7414

7415
If special events are as common as major delays, it would
7416
be important to include them in the model.  We could do that by
7417
extending the distribution of \py{lam} to include occasional
7418
large values.
7419

7420
We started with the assumption that we know
7421
distribution of \py{z}.
7422
As an alternative, a passenger could estimate \py{z}, but it would
7423
not be easy.
7424
As a passenger, you only
7425
observe only your own wait time, \py{y}.  Unless you skip
7426
the first train and wait for the second, you don't
7427
observe the gap between trains, \py{z}.
7428

7429
However, we could make some inferences about \py{zb}.  If we note
7430
the number of passengers waiting when we arrive, we can estimate
7431
the elapsed time since the last train, \py{x}.  Then we observe
7432
\py{y}.  If we add the posterior distribution of \py{x} to
7433
the observed \py{y}, we get a distribution that represents
7434
our posterior belief about the observed value of \py{zb}.
7435

7436
We can use this distribution to update our beliefs about the
7437
distribution of \py{zb}.  Finally, we can compute the
7438
inverse of \py{BiasPmf} to get from the distribution of \py{zb}
7439
to the distribution of \py{z}.
7440

7441
I leave this analysis as an exercise for the
7442
reader.  One suggestion: you should read Chapter~\ref{species} first.
7443
You can find the outline of
7444
a solution in \url{http://thinkbayes.com/redline.py}.
7445
  For more information
7446
see Section~\ref{download}.
7447

7448
\section{Exercises}
7449

7450
\begin{exercise}
7451
This exercise is from
7452
MacKay, {\em Information Theory, Inference, and Learning Algorithms}:
7453
\index{MacKay, David}
7454

7455
\begin{quote}
7456
    Unstable particles are emitted from a source and decay at a
7457
distance $x$, a real number that has an exponential probability
7458
distribution with [parameter] $\lam$.  Decay events can only be
7459
observed if they occur in a window extending from $x=1$ cm to $x=20$
7460
cm.  $N$ decays are observed at locations $\{ 1.5, 2, 3, 4, 5, 12 \}$
7461
cm.  What is the posterior distribution of $\lam$?
7462

7463
\end{quote}
7464

7465
You can download a solution to this exercise from
7466
\url{http://thinkbayes.com/decay.py}.
7467

7468
\end{exercise}
7469

7470

7471

7472
\chapter{Hypothesis Testing}
7473
\label{hypotest}
7474

7475
\section{Back to the Euro problem}
7476

7477
In Section~\ref{euro} I presented a problem from MacKay's {\it Information
7478
  Theory, Inference, and Learning Algorithms}:
7479
\index{MacKay, David}
7480

7481
\begin{quote}
7482
A statistical statement appeared in ``The Guardian" on Friday January 4, 2002:
7483

7484
  \begin{quote}
7485
        When spun on edge 250 times, a Belgian one-euro coin came
7486
        up heads 140 times and tails 110.  `It looks very suspicious
7487
        to me,' said Barry Blight, a statistics lecturer at the London
7488
        School of Economics.  `If the coin were unbiased, the chance of
7489
        getting a result as extreme as that would be less than 7\%.'
7490
        \end{quote}
7491

7492
But do these data give evidence that the coin is biased rather than fair?
7493
\end{quote}
7494

7495
We estimated the probability that the coin would
7496
land face up, but we didn't really answer MacKay's question:
7497
Do the data give evidence that the coin is biased?
7498
\index{Euro problem}
7499
\index{evidence}
7500

7501
In Chapter~\ref{more} I proposed that data are in favor of
7502
a hypothesis if the data are more likely under the hypothesis than
7503
under the alternative or, equivalently, if the Bayes factor is greater
7504
than 1.
7505
\index{hypothesis testing}
7506
\index{Bayes factor}
7507

7508
In the Euro example, we have two hypotheses to consider: I'll use
7509
$F$ for the hypothesis that the coin is fair and $B$ for the hypothesis
7510
that it is biased.
7511
\index{fair coin}
7512
\index{biased coin}
7513

7514
If the coin is fair, it is easy to compute the likelihood of the
7515
data, \p{D|F}.  In fact, we already wrote the function
7516
that does it.
7517

7518
\begin{code}
7519
    def Likelihood(self, data, hypo):
7520
        x = hypo / 100.0
7521
        head, tails = data
7522
        like = x**heads * (1-x)**tails
7523
        return like
7524
\end{code}
7525

7526
To use it we can
7527
create a \py{Euro} suite and invoke
7528
\py{Likelihood}:
7529

7530
\begin{code}
7531
    suite = Euro()
7532
    likelihood = suite.Likelihood(data, 50)
7533
\end{code}
7534

7535
\p{D|F} is $5.5 \cdot 10^{-76}$, which doesn't tell us much except
7536
that the probability of seeing any particular dataset is very small.
7537
It takes two likelihoods to make a ratio, so we also have to
7538
compute \p{D|B}.
7539

7540
It is not obvious how to compute the likelihood of $B$, because
7541
it's not obvious what ``biased'' means.
7542

7543
One possibility is to cheat and look at the data before we define
7544
the hypothesis.  In that case we would say that ``biased'' means that
7545
the probability of heads is 140/250.
7546

7547
\begin{code}
7548
    actual_percent = 100.0 * 140 / 250
7549
    likelihood = suite.Likelihood(data, actual_percent)
7550
\end{code}
7551

7552
This version of $B$ I call \verb"B_cheat"; the likelihood of
7553
\verb"b_cheat" is $34 \cdot 10^{-76}$ and the likelihood ratio is
7554
6.1.  So we would say that the data are evidence in favor of this
7555
version of $B$.
7556
\index{evidence}
7557

7558
But using the data to formulate the hypothesis
7559
is obviously bogus.  By that definition, any dataset would
7560
be evidence in favor of $B$, unless the observed percentage of heads
7561
is exactly 50\%.
7562
\index{bogus}
7563

7564
\section{Making a fair comparison}
7565
\label{suitelike}
7566

7567
To make a legitimate comparison, we have to define $B$ without looking
7568
at the data.  So let's try a different definition.  If you inspect
7569
a Belgian Euro coin, you might notice that the ``heads'' side is more
7570
prominent than the ``tails'' side.  You might expect the shape to
7571
have some effect on
7572
$x$, but be unsure whether it makes heads more or less
7573
likely.  So you might say ``I think the coin is biased so that
7574
$x$ is either 0.6 or 0.4, but I am not sure which.''
7575

7576
We can think of this version, which I'll call \verb"B_two"
7577
as a hypothesis made up of two
7578
sub-hypotheses.  We can compute the likelihood for each
7579
sub-hypothesis and then compute the average likelihood.
7580

7581
\begin{code}
7582
    like40 = suite.Likelihood(data, 40)
7583
    like60 = suite.Likelihood(data, 60)
7584
    likelihood = 0.5 * like40 + 0.5 * like60
7585
\end{code}
7586

7587
The likelihood ratio (or Bayes factor) for \verb"b_two" is 1.3, which
7588
means the data provide weak evidence in favor of \verb"b_two".
7589
\index{evidence}
7590
\index{likelihood ratio}
7591
\index{Bayes factor}
7592

7593
More generally, suppose you suspect that the coin is biased, but you
7594
have no clue about the value of $x$.  In that case you might build a
7595
Suite, which I call \verb"b_uniform", to represent sub-hypotheses from
7596
0 to 100.
7597

7598
\begin{code}
7599
    b_uniform = Euro(xrange(0, 101))
7600
    b_uniform.Remove(50)
7601
    b_uniform.Normalize()
7602
\end{code}
7603

7604
I initialize \verb"b_uniform" with values from 0 to 100.
7605
I removed the sub-hypothesis that $x$ is 50\%, because if
7606
$x$ is 50\% the coin is fair, but it has almost no
7607
effect on the result whether you remove it or not.
7608

7609
To compute the likelihood of
7610
\verb"b_uniform" we compute the likelihood of each sub-hypothesis
7611
and accumulate a weighted average.
7612

7613
\begin{code}
7614
def SuiteLikelihood(suite, data):
7615
    total = 0
7616
    for hypo, prob in suite.Items():
7617
        like = suite.Likelihood(data, hypo)
7618
        total += prob * like
7619
    return total
7620
\end{code}
7621

7622
The likelihood ratio for \verb"b_uniform" is 0.47, which means
7623
that the data are weak evidence against \verb"b_uniform",
7624
compared to $F$.
7625
\index{likelihood}
7626

7627
If you think about the computation performed by
7628
\verb"SuiteLikelihood", you might notice that it is similar to an
7629
update.  To refresh your memory, here's the \py{Update} function:
7630

7631
\begin{code}
7632
    def Update(self, data):
7633
        for hypo in self.Values():
7634
            like = self.Likelihood(data, hypo)
7635
            self.Mult(hypo, like)
7636
        return self.Normalize()
7637
\end{code}
7638

7639
And here's \py{Normalize}:
7640

7641
\begin{code}
7642
    def Normalize(self):
7643
        total = self.Total()
7644

7645
        factor = 1.0 / total
7646
        for x in self.d:
7647
            self.d[x] *= factor
7648

7649
        return total
7650
\end{code}
7651

7652
The return value from \py{Normalize} is the total of the
7653
probabilities in the Suite, which is the average of the likelihoods
7654
for the sub-hypotheses, weighted by the prior probabilities.  And {\tt
7655
  Update} passes this value along, so instead of using {\tt
7656
  SuiteLikelihood}, we could compute the likelihood of
7657
\verb"b_uniform" like this:
7658

7659
\begin{code}
7660
    likelihood = b_uniform.Update(data)
7661
\end{code}
7662

7663

7664

7665
\section{The triangle prior}
7666

7667
In Chapter~\ref{more} we also considered a triangle-shaped prior that
7668
gives higher probability to values of $x$ near 50\%.  If we think of
7669
this prior as a suite of sub-hypotheses, we can compute its likelihood
7670
like this:
7671
\index{triangle distribution}
7672

7673
\begin{code}
7674
    b_triangle = TrianglePrior()
7675
    likelihood = b_triangle.Update(data)
7676
\end{code}
7677

7678
The likelihood ratio for \verb"b_triangle" is 0.84, compared to $F$, so
7679
again we would say that the data are weak evidence against $B$.
7680
\index{evidence}
7681

7682
The following table shows the priors we have considered, the
7683
likelihood of each, and the likelihood ratio (or Bayes factor)
7684
relative to $F$.
7685
\index{likelihood ratio}
7686
\index{Bayes factor}
7687

7688
\begin{tabular}{|l|r|r|}
7689
\hline
7690
Hypothesis   & Likelihood & Bayes  \\
7691
             & $\times 10^{-76}$ & Factor  \\
7692
\hline
7693
$F$              & 5.5   & --   \\
7694
\verb"B_cheat"  & 34   &  6.1   \\
7695
\verb"B_two"     & 7.4   &  1.3   \\
7696
\verb"B_uniform"  & 2.6   &  0.47   \\
7697
\verb"B_triangle"  & 4.6   &  0.84   \\
7698
\hline
7699
\end{tabular}
7700

7701
Depending on which definition we choose, the data might provide
7702
evidence for or against the hypothesis that the coin is biased, but
7703
in either case it is relatively weak evidence.
7704

7705
In summary, we can use Bayesian hypothesis testing to compare the
7706
likelihood of $F$ and $B$, but we have to do some work to specify
7707
precisely what $B$ means.  This specification depends on background
7708
information about coins and their behavior when spun, so people
7709
could reasonably disagree about the right definition.
7710

7711
My presentation of this example follows
7712
David MacKay's discussion, and comes to the same conclusion.
7713
You can download the code I used in this chapter from
7714
\url{http://thinkbayes.com/euro3.py}.
7715
  For more information
7716
see Section~\ref{download}.
7717

7718
\section{Discussion}
7719

7720
The Bayes factor for \verb"B_uniform" is 0.47, which means
7721
that the data provide evidence against this hypothesis, compared
7722
to $F$.  In the previous section I characterized this evidence
7723
as ``weak,'' but didn't say why.
7724
\index{evidence}
7725

7726
Part of the answer is historical.  Harold Jeffreys, an early
7727
proponent of Bayesian statistics, suggested a scale for
7728
interpreting Bayes factors:
7729

7730
\begin{tabular}{|l|l|}
7731
\hline
7732
Bayes & Strength \\
7733
Factor & \\
7734
\hline
7735
1 -- 3 & Barely worth mentioning \\
7736
3 -- 10 & Substantial \\
7737
10 -- 30 & Strong \\
7738
30 -- 100 & Very strong \\
7739
$>$ 100 & Decisive \\
7740
\hline
7741
\end{tabular}
7742

7743
In the example, the Bayes factor is 0.47 in favor of \verb"B_uniform",
7744
so it is 2.1 in favor of $F$, which Jeffreys would consider ``barely
7745
worth mentioning.''  Other authors have suggested variations on the
7746
wording.  To avoid arguing about adjectives, we could think about odds
7747
instead.
7748

7749
If your prior odds are 1:1, and you see evidence with Bayes
7750
factor 2, your posterior odds are 2:1.  In terms of probability,
7751
the data changed your degree of belief from 50\% to 66\%.  For
7752
most real world problems, that change would be small relative
7753
to modeling errors and other sources of uncertainty.
7754

7755
On the other hand, if you had seen evidence with Bayes
7756
factor 100, your posterior odds would be 100:1 or more than 99\%.
7757
Whether or not you agree that such evidence is ``decisive,''
7758
it is certainly strong.
7759

7760
%TODO: postpone this section
7761
\section{The beta distribution}
7762
\label{beta}
7763

7764
\index{beta distribution}
7765
There is one more optimization that solves this problem
7766
even faster.
7767

7768
So far we have used a Pmf object to represent a discrete set of
7769
values for \py{x}.  Now we will use a continuous
7770
distribution, specifically the beta distribution (see
7771
\url{http://en.wikipedia.org/wiki/Beta_distribution}).
7772
\index{continuous distribution}
7773

7774
The beta distribution is defined on the interval from 0 to 1
7775
(including both), so it is a natural choice for describing
7776
proportions and probabilities.  But wait, it gets better.
7777

7778
%TODO: explain the binomial distribution in the previous section
7779

7780
It turns out that if you do a Bayesian update with a binomial
7781
likelihood function, which is what we did in the previous section, the beta
7782
distribution is a {\bf conjugate prior}.  That means that if the prior
7783
distribution for \py{x} is a beta distribution, the posterior is also
7784
a beta distribution.  But wait, it gets even better.
7785
\index{binomial likelihood function}
7786
\index{conjugate prior}
7787

7788
The shape of the beta distribution depends on two parameters, written
7789
$\alpha$ and $\beta$, or \py{alpha} and \py{beta}.  If the prior
7790
is a beta distribution with parameters \py{alpha} and \py{beta}, and
7791
we see data with \py{h} heads and \py{t} tails, the posterior is a
7792
beta distribution with parameters \py{alpha+h} and \py{beta+t}.  In
7793
other words, we can do an update with two additions.
7794
\index{parameter}
7795

7796
So that's great, but it only works if we can find a beta distribution
7797
that is a good choice for a prior.  Fortunately, for many realistic
7798
priors there is a beta distribution that is at least a good
7799
approximation, and for a uniform prior there is a perfect match.  The
7800
beta distribution with \py{alpha=1} and \py{beta=1} is uniform from
7801
0 to 1.
7802

7803
Let's see how we can take advantage of all this.
7804
\py{thinkbayes.py} provides
7805
a class that represents a beta distribution:
7806
\index{Beta object}
7807

7808
\begin{code}
7809
class Beta(object):
7810

7811
    def __init__(self, alpha=1, beta=1):
7812
        self.alpha = alpha
7813
        self.beta = beta
7814
\end{code}
7815

7816
By default \verb"__init__" makes a uniform distribution.
7817
\py{Update} performs a Bayesian update:
7818

7819
\begin{code}
7820
    def Update(self, data):
7821
        heads, tails = data
7822
        self.alpha += heads
7823
        self.beta += tails
7824
\end{code}
7825

7826
\py{data} is a pair of integers representing the number of
7827
heads and tails.
7828

7829
So we have yet another way to solve the Euro problem:
7830

7831
\begin{code}
7832
    beta = thinkbayes.Beta()
7833
    beta.Update((140, 110))
7834
    print beta.Mean()
7835
\end{code}
7836

7837
\py{Beta} provides \py{Mean}, which
7838
computes a simple function of \py{alpha}
7839
and \py{beta}:
7840

7841
\begin{code}
7842
    def Mean(self):
7843
        return float(self.alpha) / (self.alpha + self.beta)
7844
\end{code}
7845

7846
For the Euro problem the posterior mean is 56\%, which is the
7847
same result we got using Pmfs.
7848

7849
\py{Beta} also provides \py{EvalPdf}, which evaluates
7850
the probability density
7851
function (PDF)  of the beta distribution:
7852
\index{probability density function}
7853
\index{PDF}
7854

7855
\begin{code}
7856
    def EvalPdf(self, x):
7857
        return x**(self.alpha-1) * (1-x)**(self.beta-1)
7858
\end{code}
7859

7860
Finally, \py{Beta} provides \py{MakePmf}, which
7861
uses \py{EvalPdf} to generate a discrete approximation
7862
of the beta distribution.
7863

7864
%This expression might look familiar.  Here's {\tt
7865
%  thinkbayes.EvalBinomialPmf}
7866

7867
%\begin{code}
7868
%def EvalBinomialPmf(x, yes, no):
7869
%    return x**yes * (1-x)**no
7870
%\end{code}
7871

7872
%It's the same function, but in \py{EvalPdf}, we think of \py{x} as a
7873
%random variable and \py{alpha} and \py{beta} as parameters; in {\tt
7874
%  EvalBinomialPmf}, \py{x} is the parameter, and \py{yes} and {\tt
7875
%  no} are random variables.  Distributions like these that share the
7876
%same PDF are called {\bf conjugate distributions}.
7877
%\index{conjugate distribution}
7878

7879

7880
\section{Exercises}
7881

7882
%TODO: Revisit the Poincare problem; how much evidence would
7883
% Poincare have at the end of the year to distinguish between
7884
% N(1000, sigma) and Max_4 N(950, sigma2)?
7885

7886
\begin{exercise}
7887
Some people believe in the existence of extra-sensory
7888
perception (ESP); for example, the ability of some people to guess
7889
the value of an unseen playing card with probability better
7890
than chance.
7891
\index{ESP}
7892
\index{extra-sensory perception}
7893

7894
What is your prior degree of belief in this kind of ESP?
7895
Do you think it is as likely to exist as not?  Or are you
7896
more skeptical about it?  Write down your prior odds.
7897

7898
Now compute the strength of the evidence it would take to
7899
convince you that ESP is at least 50\% likely to exist.
7900
What Bayes factor would be needed to make you 90\% sure
7901
that ESP exists?
7902

7903
%TODO: figure out where to talk about Cromwell's rule
7904
Also, notice that in a Bayesian update, we multiply
7905
each prior probability by a likelihood, so if \p{H} is 0,
7906
\p{H|D} is also 0, regardless of $D$.  In the Euro problem,
7907
if you are convinced that \py{x} is less than 50\%, and you assign
7908
probability 0 to all other hypotheses, no amount of data will
7909
convince you otherwise.
7910
\index{Euro problem}
7911

7912
This observation is the basis of {\bf Cromwell's rule}, which is the
7913
recommendation that you should avoid giving a prior probability of
7914
0 to any hypothesis that is even remotely possible
7915
(see \url{http://en.wikipedia.org/wiki/Cromwell's_rule}).
7916
\index{Cromwell's rule}
7917

7918
Cromwell's rule is named after Oliver Cromwell, who wrote, ``I beseech
7919
you, in the bowels of Christ, think it possible that you may be
7920
mistaken.''  For Bayesians, this turns out to be good advice (even if
7921
it's a little overwrought).
7922
\index{Cromwell, Oliver}
7923
\end{exercise}
7924

7925

7926
\begin{exercise}
7927
Suppose that your answer to the previous question is 1000;
7928
that is, evidence with Bayes factor 1000 in favor of ESP would
7929
be sufficient to change your mind.
7930

7931
Now suppose that you read a paper in a respectable peer-reviewed
7932
scientific journal that presents evidence with Bayes factor 1000 in
7933
favor of ESP.  Would that change your mind?
7934

7935
If not, how do you resolve the apparent contradiction?
7936
You might find it helpful to read about David Hume's article, ``Of
7937
Miracles,'' at \url{http://en.wikipedia.org/wiki/Of_Miracles}.
7938
\index{Hume, David}
7939

7940
\end{exercise}
7941

7942

7943

7944
\chapter{Evidence}
7945
\label{evidence}
7946

7947
%TODO: Make this chapter about dynamic testing; check if it is
7948
% optimal to chose questions where the respondent has a 50/50
7949
% chance.
7950

7951
\section{Interpreting SAT scores}
7952

7953
Suppose you are the Dean of Admission at a small engineering
7954
college in Massachusetts, and you are considering two candidates,
7955
Alice and Bob, whose qualifications are similar in many ways,
7956
with the exception that Alice got a higher score on the Math
7957
portion of the SAT, a standardized test intended to measure
7958
preparation for college-level work in mathematics.
7959
\index{SAT}
7960
\index{standardized test}
7961

7962
If Alice got 780 and Bob got a 740 (out of a possible 800), you might
7963
want to know whether that difference is evidence that Alice is better
7964
prepared than Bob, and what the strength of that evidence is.
7965
\index{evidence}
7966

7967
Now in reality, both scores are very good, and both
7968
candidates are probably well prepared for college math.  So
7969
the real Dean of Admission would probably suggest that we choose
7970
the candidate who best demonstrates the other skills and
7971
attitudes we look for in students.  But as an example of
7972
Bayesian hypothesis testing, let's stick with a narrower question:
7973
``How strong is the evidence that Alice is better prepared
7974
than Bob?''
7975

7976
To answer that question, we need to make some modeling decisions.
7977
I'll start with a simplification I know is wrong; then we'll come back
7978
and improve the model.  I pretend, temporarily, that
7979
all SAT questions are equally difficult.  Actually, the designers of
7980
the SAT choose questions with a range of difficulty, because that
7981
improves the ability to measure statistical differences between
7982
test-takers.
7983
\index{modeling}
7984

7985
But if we choose a model where all questions are equally difficult, we
7986
can define a characteristic, \verb"p_correct", for each test-taker,
7987
which is the probability of answering any question correctly.  This
7988
simplification makes it easy to compute the likelihood of a given
7989
score.
7990

7991

7992
\section{The scale}
7993

7994
In order to understand SAT scores, we have to understand the scoring
7995
and scaling process.  Each test-taker gets a raw score based on the
7996
number of correct and incorrect questions.  The raw score is converted
7997
to a scaled score in the range 200--800.
7998
\index{scaled score}
7999

8000
In 2009, there were 54 questions on the math SAT.  The raw score
8001
for each test-taker is the number of questions answered correctly
8002
minus a penalty of $1/4$ point for each question answered incorrectly.
8003

8004
The College Board, which administers the SAT, publishes the
8005
map from raw scores to scaled scores.  I have downloaded that
8006
data and wrapped it in an Interpolator object that provides a forward
8007
lookup (from raw score to scaled) and a reverse lookup (from scaled
8008
score to raw).
8009
\index{College Board}
8010

8011
You can download the code for this example from
8012
\url{http://thinkbayes.com/sat.py}.
8013
  For more information
8014
see Section~\ref{download}.
8015

8016
\section{The prior}
8017

8018
The College Board also publishes the distribution of scaled scores
8019
for all test-takers.  If we convert each scaled score to a raw score,
8020
and divide by the number of questions, the result is an estimate
8021
of \verb"p_correct".
8022
So we can use the distribution of raw scores to model the
8023
prior distribution of \verb"p_correct".
8024

8025
Here is the code that reads and processes the data:
8026

8027
\begin{code}
8028
class Exam(object):
8029

8030
    def __init__(self):
8031
        self.scale = ReadScale()
8032
        scores = ReadRanks()
8033
        score_pmf = thinkbayes.MakePmfFromDict(dict(scores))
8034
        self.raw = self.ReverseScale(score_pmf)
8035
        self.max_score = max(self.raw.Values())
8036
        self.prior = DivideValues(self.raw, self.max_score)
8037
\end{code}
8038

8039
\py{Exam} encapsulates the information we have about the exam.
8040
\py{ReadScale} and \py{ReadRanks} read files and return
8041
objects that contain the data:
8042
\py{self.scale} is the \py{Interpolator} that converts
8043
from raw to scaled scores and back;  \py{scores} is a list
8044
of (score, frequency) pairs.
8045

8046
\verb"score_pmf" is the Pmf of
8047
scaled scores.   \py{self.raw} is the Pmf of raw scores, and
8048
\py{self.prior} is the Pmf of \verb"p_correct".
8049

8050
\begin{figure}
8051
% sat.py
8052
\centerline{\includegraphics[height=2.5in]{figs/sat_prior.pdf}}
8053
\caption{Prior distribution of \py{p_correct} for SAT test-takers.}
8054
\label{fig.satprior}
8055
\end{figure}
8056

8057
Figure~\ref{fig.satprior} shows the prior distribution of
8058
\verb"p_correct".  This distribution is approximately Gaussian, but it
8059
is compressed at the extremes.  By design, the SAT has the most power
8060
to discriminate between test-takers within two standard deviations of
8061
the mean, and less power outside that range.
8062
\index{Gaussian distribution}
8063

8064
For each test-taker, I define a Suite called \py{Sat} that
8065
represents the distribution of \verb"p_correct".  Here's the definition:
8066

8067
\begin{code}
8068
class Sat(thinkbayes.Suite):
8069

8070
    def __init__(self, exam, score):
8071
        thinkbayes.Suite.__init__(self)
8072

8073
        self.exam = exam
8074
        self.score = score
8075

8076
        # start with the prior distribution
8077
        for p_correct, prob in exam.prior.Items():
8078
            self.Set(p_correct, prob)
8079

8080
        # update based on an exam score
8081
        self.Update(score)
8082
\end{code}
8083

8084
\verb"__init__" takes an Exam object and a scaled score.  It makes a
8085
copy of the prior distribution and then updates itself based on the
8086
exam score.
8087

8088
As usual, we inherit \py{Update} from \py{Suite} and provide
8089
\py{Likelihood}:
8090

8091
\begin{code}
8092
    def Likelihood(self, data, hypo):
8093
        p_correct = hypo
8094
        score = data
8095

8096
        k = self.exam.Reverse(score)
8097
        n = self.exam.max_score
8098
        like = thinkbayes.EvalBinomialPmf(k, n, p_correct)
8099
        return like
8100
\end{code}
8101

8102
\py{hypo} is a hypothetical
8103
value of \verb"p_correct", and \py{data} is a scaled score.
8104

8105
To keep things simple, I interpret the raw score as the number of
8106
correct answers, ignoring the penalty for wrong answers.  With
8107
this simplification, the likelihood is given by the binomial
8108
distribution, which computes the probability of $k$ correct
8109
responses out of $n$ questions.
8110
\index{binomial distribution}
8111
\index{raw score}
8112

8113

8114
\section{Posterior}
8115

8116
\begin{figure}
8117
% sat.py
8118
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_p_corr.pdf}}
8119
\caption{Posterior distributions of \py{p_correct} for Alice and Bob.}
8120
\label{fig.satposterior1}
8121
\end{figure}
8122

8123
Figure~\ref{fig.satposterior1} shows the posterior distributions
8124
of \verb"p_correct" for Alice and Bob based on their exam scores.
8125
We can see that they overlap, so it is possible that \verb"p_correct"
8126
is actually higher for Bob, but it seems unlikely.
8127

8128
Which brings us back to the original question, ``How strong is the
8129
evidence that Alice is better prepared than Bob?''  We can use the
8130
posterior distributions of \verb"p_correct" to answer this question.
8131

8132
To formulate the question in terms of Bayesian hypothesis testing,
8133
I define two hypotheses:
8134

8135
\begin{itemize}
8136

8137
\item $A$: \verb"p_correct" is higher for Alice than for Bob.
8138

8139
\item $B$: \verb"p_correct" is higher for Bob than for Alice.
8140

8141
\end{itemize}
8142

8143
To compute the likelihood of $A$, we can enumerate all pairs of values
8144
from the posterior distributions and add up the total probability of
8145
the cases where \verb"p_correct" is higher for Alice than for Bob.
8146
And we already have a function, \verb"thinkbayes.PmfProbGreater",
8147
that does that.
8148

8149
So we can define a Suite that computes the posterior probabilities
8150
of $A$ and $B$:
8151

8152
\begin{code}
8153
class TopLevel(thinkbayes.Suite):
8154

8155
    def Update(self, data):
8156
        a_sat, b_sat = data
8157

8158
        a_like = thinkbayes.PmfProbGreater(a_sat, b_sat)
8159
        b_like = thinkbayes.PmfProbLess(a_sat, b_sat)
8160
        c_like = thinkbayes.PmfProbEqual(a_sat, b_sat)
8161

8162
        a_like += c_like / 2
8163
        b_like += c_like / 2
8164

8165
        self.Mult('A', a_like)
8166
        self.Mult('B', b_like)
8167

8168
        self.Normalize()
8169
\end{code}
8170

8171
Usually when we define a new Suite, we inherit \py{Update}
8172
and provide \py{Likelihood}.  In this case I override \py{Update},
8173
because it is easier to evaluate the likelihood of both
8174
hypotheses at the same time.
8175

8176
The data passed to \py{Update} are Sat objects that represent
8177
the posterior distributions of \verb"p_correct".
8178

8179
\verb"a_like" is the total probability that
8180
\verb"p_correct" is higher for Alice; \verb"b_like" is that
8181
probability that it is higher for Bob.
8182

8183
\verb"c_like" is the probability that they are ``equal,'' but this
8184
equality is an artifact of the decision to model \verb"p_correct" with
8185
a set of discrete values.  If we use more values, \verb"c_like"
8186
is smaller, and in the extreme, if \verb"p_correct" is
8187
continuous, \verb"c_like" is zero.  So I treat \verb"c_like" as
8188
a kind of round-off error and split it evenly between \verb"a_like"
8189
and \verb"b_like".
8190

8191
Here is the code that creates \py{TopLevel} and updates it:
8192

8193
\begin{code}
8194
    exam = Exam()
8195
    a_sat = Sat(exam, 780)
8196
    b_sat = Sat(exam, 740)
8197

8198
    top = TopLevel('AB')
8199
    top.Update((a_sat, b_sat))
8200
    top.Print()
8201
\end{code}
8202

8203
The likelihood of $A$ is 0.79 and the likelihood of $B$ is 0.21.  The
8204
likelihood ratio (or Bayes factor) is 3.8, which means that these test
8205
scores are evidence that Alice is better than Bob at answering SAT
8206
questions.  If we believed, before seeing the test scores, that $A$
8207
and $B$ were equally likely, then after seeing the scores we should
8208
believe that the probability of $A$ is 79\%, which means there is
8209
still a 21\% chance that Bob is actually better prepared.
8210
\index{likelihood ratio}
8211
\index{Bayes factor}
8212

8213

8214
\section{A better model}
8215

8216
Remember that the analysis we have done so far is based on
8217
the simplification that all SAT questions are equally difficult.
8218
In reality, some are easier than others, which means that the
8219
difference between Alice and Bob might be even smaller.
8220

8221
But how big is the modeling error?  If it is small, we conclude
8222
that the first model---based on the simplification that all questions
8223
are equally difficult---is good enough.  If it's large,
8224
we need a better model.
8225
\index{modeling error}
8226

8227
In the next few sections, I develop a better model and
8228
discover (spoiler alert!) that the modeling error is small.  So if
8229
you are satisfied with the simple model, you can skip to the next
8230
chapter.  If you want to see how the more realistic model works,
8231
read on...
8232

8233
\begin{itemize}
8234

8235
\item Assume that each test-taker has some
8236
  degree of \py{efficacy}, which measures their
8237
  ability to answer SAT questions.
8238
\index{efficacy}
8239

8240
\item Assume that each question has some level of
8241
  \py{difficulty}.
8242

8243
\item Finally, assume that the chance that a test-taker answers a
8244
  question correctly is related to \py{efficacy} and \py{difficulty}
8245
  according to this function:
8246

8247
\begin{code}
8248
def ProbCorrect(efficacy, difficulty, a=1):
8249
    return 1 / (1 + math.exp(-a * (efficacy - difficulty)))
8250
\end{code}
8251

8252
\end{itemize}
8253

8254
This function is a simplified version of the curve used in {\bf item
8255
response theory}, which you can read about at
8256
\url{http://en.wikipedia.org/wiki/Item_response_theory}.  {\tt
8257
  efficacy} and \py{difficulty} are considered to be on the same
8258
scale, and the probability of getting a question right depends only on
8259
the difference between them.
8260
\index{item response theory}
8261

8262
When \py{efficacy} and \py{difficulty} are equal, the
8263
probability of getting the question right is 50\%.  As
8264
\py{efficacy} increases, this probability approaches 100\%.
8265
As it decreases (or as \py{difficulty} increases), the
8266
probability approaches 0\%.
8267

8268
Given the distribution of \py{efficacy} across test-takers
8269
and the distribution of \py{difficulty} across questions, we
8270
can compute the expected distribution of raw scores.  We'll do that
8271
in two steps.  First, for a person with given \py{efficacy},
8272
we'll compute the distribution of raw scores.
8273

8274
\begin{code}
8275
def PmfCorrect(efficacy, difficulties):
8276
    pmf0 = thinkbayes.Pmf([0])
8277

8278
    ps = [ProbCorrect(efficacy, diff) for diff in difficulties]
8279
    pmfs = [BinaryPmf(p) for p in ps]
8280
    dist = sum(pmfs, pmf0)
8281
    return dist
8282
\end{code}
8283

8284
\py{difficulties} is a list of difficulties, one for each question.
8285
\py{ps} is a list of probabilities, and \py{pmfs} is a list of
8286
two-valued Pmf objects; here's the function that makes them:
8287

8288
\begin{code}
8289
def BinaryPmf(p):
8290
    pmf = thinkbayes.Pmf()
8291
    pmf.Set(1, p)
8292
    pmf.Set(0, 1-p)
8293
    return pmf
8294
\end{code}
8295

8296
\py{dist} is the sum of these Pmfs.  Remember from Section~\ref{addends}
8297
that when we add up Pmf objects, the result is the distribution
8298
of the sums.  In order to use Python's \py{sum} to add up Pmfs,
8299
we have to provide \py{pmf0} which is the identity for Pmfs,
8300
so \py{pmf + pmf0} is always \py{pmf}.
8301

8302
If we know a person's efficacy, we can compute their distribution
8303
of raw scores.  For a group of people with a different efficacies, the
8304
resulting distribution of raw scores is a mixture.  Here's the code
8305
that computes the mixture:
8306

8307
\begin{code}
8308
# class Exam:
8309

8310
    def MakeRawScoreDist(self, efficacies):
8311
        pmfs = thinkbayes.Pmf()
8312
        for efficacy, prob in efficacies.Items():
8313
            scores = PmfCorrect(efficacy, self.difficulties)
8314
            pmfs.Set(scores, prob)
8315

8316
        mix = thinkbayes.MakeMixture(pmfs)
8317
        return mix
8318
\end{code}
8319

8320
\py{MakeRawScoreDist} takes \py{efficacies}, which is a Pmf that
8321
represents the distribution of efficacy across test-takers.  I assume
8322
it is Gaussian with mean 0 and standard deviation 1.5.  This
8323
choice is mostly arbitrary.  The probability of getting a question
8324
correct depends on the difference between efficacy and difficulty, so
8325
we can choose the units of efficacy and then calibrate the units of
8326
difficulty accordingly.  \index{Gaussian distribution}
8327

8328
\py{pmfs} is a meta-Pmf that contains one Pmf for each level of
8329
efficacy, and maps to the fraction of test-takers at that level.  {\tt
8330
  MakeMixture} takes the meta-pmf and computes the distribution of the
8331
mixture (see Section~\ref{mixture}).  \index{meta-Pmf}
8332
\index{MakeMixture}
8333

8334

8335
\section{Calibration}
8336

8337
If we were given the distribution of difficulty, we could use
8338
\verb"MakeRawScoreDist" to compute the distribution of raw scores.
8339
But for us the problem is the other way around: we are given the
8340
distribution of raw scores and we want to infer the distribution of
8341
difficulty.
8342

8343
\begin{figure}
8344
% sat.py
8345
\centerline{\includegraphics[height=2.5in]{figs/sat_calibrate.pdf}}
8346
\caption{Actual distribution of raw scores and a model to fit it.}
8347
\label{fig.satcalibrate}
8348
\end{figure}
8349

8350
I assume that the distribution of difficulty is uniform with
8351
parameters \py{center} and \py{width}.  \py{MakeDifficulties}
8352
makes a list of difficulties with these parameters.
8353
\index{numpy}
8354

8355
\begin{code}
8356
def MakeDifficulties(center, width, n):
8357
    low, high = center-width, center+width
8358
    return numpy.linspace(low, high, n)
8359
\end{code}
8360

8361
By trying out a few combinations, I found that
8362
\py{center=-0.05} and \py{width=1.8} yield a distribution
8363
of raw scores similar to the actual data, as shown in
8364
Figure~\ref{fig.satcalibrate}.
8365
\index{calibration}
8366

8367
So, assuming that the distribution of difficulty is uniform,
8368
its range is approximately
8369
\py{-1.85} to \py{1.75}, given that
8370
efficacy is Gaussian with mean 0 and standard deviation 1.5.
8371
\index{Gaussian distribution}
8372

8373
The following table shows the range of \py{ProbCorrect} for
8374
test-takers at different levels of efficacy:
8375

8376
\begin{tabular}{|r|r|r|r|}
8377
\hline
8378
           & \multicolumn{3}{|c|}{Difficulty} \\
8379
\hline
8380
Efficacy   & -1.85   &   -0.05   &      1.75  \\
8381
\hline
8382
3.00 &  0.99 &  0.95 &  0.78   \\
8383
1.50 &  0.97 &  0.82 &  0.44   \\
8384
0.00 &  0.86 &  0.51 &  0.15   \\
8385
-1.50 &  0.59 &  0.19 &  0.04   \\
8386
-3.00 &  0.24 &  0.05 &  0.01   \\
8387
\hline
8388
\end{tabular}
8389

8390
Someone with efficacy 3 (two standard deviations above
8391
the mean) has a 99\% chance of answering the easiest questions on
8392
the exam, and a 78\% chance of answering the hardest.  On the other
8393
end of the range, someone two standard deviations below the mean
8394
has only a 24\% chance of answering the easiest questions.
8395

8396

8397
\section{Posterior distribution of efficacy}
8398

8399
\begin{figure}
8400
% sat.py
8401
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_eff.pdf}}
8402
\caption{Posterior distributions of efficacy for Alice and Bob.}
8403
\label{fig.satposterior2}
8404
\end{figure}
8405

8406
Now that the model is calibrated, we can compute the posterior
8407
distribution of efficacy for Alice and Bob.  Here is a version of the
8408
Sat class that uses the new model:
8409

8410
\begin{code}
8411
class Sat2(thinkbayes.Suite):
8412

8413
    def __init__(self, exam, score):
8414
        self.exam = exam
8415
        self.score = score
8416

8417
        # start with the Gaussian prior
8418
        efficacies = thinkbayes.MakeGaussianPmf(0, 1.5, 3)
8419
        thinkbayes.Suite.__init__(self, efficacies)
8420

8421
        # update based on an exam score
8422
        self.Update(score)
8423
\end{code}
8424

8425
\verb"Update" invokes
8426
\verb"Likelihood", which computes the likelihood of a given test score
8427
for a hypothetical level of efficacy.
8428

8429
\begin{code}
8430
    def Likelihood(self, data, hypo):
8431
        efficacy = hypo
8432
        score = data
8433
        raw = self.exam.Reverse(score)
8434

8435
        pmf = self.exam.PmfCorrect(efficacy)
8436
        like = pmf.Prob(raw)
8437
        return like
8438
\end{code}
8439

8440
\py{pmf} is the distribution of raw scores for a test-taker
8441
with the given efficacy; \py{like} is the probability of
8442
the observed score.
8443

8444
Figure~\ref{fig.satposterior2} shows the posterior distributions
8445
of efficacy for Alice and Bob.  As expected, the location
8446
of Alice's distribution is farther to the right, but again there
8447
is some overlap.
8448

8449
Using \py{TopLevel} again, we compare $A$, the
8450
hypothesis that Alice's efficacy is higher, and $B$, the
8451
hypothesis that Bob's is higher.  The likelihood ratio is
8452
3.4, a bit smaller than what we got from the simple model (3.8).
8453
So this model indicates that the data are evidence in favor
8454
of $A$, but a little weaker than the previous estimate.
8455

8456
If our prior belief is that $A$ and $B$ are equally likely,
8457
then in light of this evidence we would give $A$ a posterior
8458
probability of 77\%, leaving a 23\% chance that Bob's efficacy
8459
is higher.
8460

8461

8462
\section{Predictive distribution}
8463

8464
The analysis we have done so far generates estimates for
8465
Alice and Bob's efficacy, but since efficacy is not directly
8466
observable, it is hard to validate the results.
8467
\index{predictive distribution}
8468

8469
To give the model predictive power, we can use it to answer
8470
a related question: ``If Alice and Bob take the math SAT
8471
again, what is the chance that Alice will do better again?''
8472

8473
We'll answer this question in two steps:
8474

8475
\begin{itemize}
8476

8477
\item We'll use the posterior distribution of efficacy to
8478
generate a predictive distribution of raw score for each test-taker.
8479

8480
\item We'll compare the two predictive distributions to compute
8481
the probability that Alice gets a higher score again.
8482

8483
\end{itemize}
8484

8485
We already have most of the code we need.  To compute
8486
the predictive distributions, we can use \verb"MakeRawScoreDist" again:
8487

8488
\begin{code}
8489
    exam = Exam()
8490
    a_sat = Sat(exam, 780)
8491
    b_sat = Sat(exam, 740)
8492

8493
    a_pred = exam.MakeRawScoreDist(a_sat)
8494
    b_pred = exam.MakeRawScoreDist(b_sat)
8495
\end{code}
8496

8497
Then we can find the likelihood that Alice does better on the second
8498
test, Bob does better, or they tie:
8499

8500
\begin{code}
8501
    a_like = thinkbayes.PmfProbGreater(a_pred, b_pred)
8502
    b_like = thinkbayes.PmfProbLess(a_pred, b_pred)
8503
    c_like = thinkbayes.PmfProbEqual(a_pred, b_pred)
8504
\end{code}
8505

8506
The probability that Alice does better on the second exam is 63\%,
8507
which means that Bob has a 37\% chance of doing as well or better.
8508

8509
Notice that we have more confidence about Alice's efficacy than we do
8510
about the outcome of the next test.  The posterior odds are 3:1 that
8511
Alice's efficacy is higher, but only 2:1 that Alice will do better on
8512
the next exam.
8513

8514

8515
\section{Discussion}
8516

8517
\begin{figure}
8518
% sat.py
8519
\centerline{\includegraphics[height=2.5in]{figs/sat_joint.pdf}}
8520
\caption{Joint posterior distribution of \py{p_correct} for Alice and Bob.}
8521
\label{fig.satjoint}
8522
\end{figure}
8523

8524
We started this chapter with the question,
8525
``How strong is the evidence that Alice is better prepared
8526
than Bob?''  On the face of it, that sounds like we want to
8527
test two hypotheses: either Alice is more prepared or Bob is.
8528

8529
But in order to compute likelihoods for these hypotheses, we
8530
have to solve an estimation problem.  For each test-taker
8531
we have to find the posterior distribution of either
8532
\verb"p_correct" or \verb"efficacy".
8533

8534
Values like this are called {\bf nuisance parameters} because
8535
we don't care what they are, but we have
8536
to estimate them to answer the question we care about.
8537
\index{nuisance parameter}
8538

8539
One way to visualize the analysis we did in this chapter is
8540
to plot the space of these parameters.  \verb"thinkbayes.MakeJoint"
8541
takes two Pmfs, computes their joint distribution, and returns
8542
a joint pmf of each possible pair of values and its probability.
8543

8544
\begin{code}
8545
def MakeJoint(pmf1, pmf2):
8546
    joint = Joint()
8547
    for v1, p1 in pmf1.Items():
8548
        for v2, p2 in pmf2.Items():
8549
            joint.Set((v1, v2), p1 * p2)
8550
    return joint
8551
\end{code}
8552

8553
This function assumes that the two distributions are independent.
8554
\index{joint distribution}
8555
\index{independence}
8556

8557
Figure~\ref{fig.satjoint} shows the joint posterior distribution of
8558
\verb"p_correct" for Alice and Bob.  The diagonal line indicates the
8559
part of the space where \verb"p_correct" is the same for Alice and
8560
Bob.  To the right of this line, Alice is more prepared; to the left,
8561
Bob is more prepared.
8562

8563
In \py{TopLevel.Update}, when we compute the likelihoods of $A$ and
8564
$B$, we add up the probability mass on each side of this line.  For the
8565
cells that fall on the line, we add up the total mass and split it
8566
between $A$ and $B$.
8567

8568
The process we used in this chapter---estimating nuisance
8569
parameters in order to evaluate the likelihood of competing
8570
hypotheses---is a common Bayesian approach to problems like this.
8571

8572

8573

8574

8575
\chapter{Simulation}
8576

8577
In this chapter I describe my solution to a problem posed
8578
by a patient with a kidney tumor.  I think the problem is
8579
important and relevant to patients with these tumors
8580
and doctors treating them.
8581

8582
And I think the solution is interesting because, although it
8583
is a Bayesian approach to the problem, the use of Bayes's theorem
8584
is implicit.  I present the solution and my code; at the end
8585
of the chapter I will explain the Bayesian part.
8586

8587
If you want more technical detail than I present here, you can
8588
read my paper on this work at \url{http://arxiv.org/abs/1203.6890}.
8589

8590

8591
\section{The Kidney Tumor problem}
8592

8593
\index{Kidney tumor problem}
8594
\index{Reddit}
8595
I am a frequent reader and occasional contributor to the online statistics
8596
forum at \url{http://reddit.com/r/statistics}.  In November 2011, I read
8597
the following message:
8598

8599
\begin{quote}
8600
"I have Stage IV Kidney Cancer and am trying to determine if the
8601
  cancer formed before I retired from the military. ... Given the
8602
  dates of retirement and detection is it possible to determine when
8603
  there was a 50/50 chance that I developed the disease? Is it
8604
  possible to determine the probability on the retirement date?  My
8605
  tumor was 15.5 cm x 15 cm at detection. Grade II."
8606
\end{quote}
8607

8608
I contacted the author of the message and got more information; I learned
8609
that veterans get different benefits if it is "more likely than not"
8610
that a tumor formed while they were in military service (among other
8611
considerations).
8612

8613
Because renal tumors grow slowly, and often do not cause symptoms,
8614
they are sometimes left untreated.  As a result, doctors can observe
8615
the rate of growth for untreated tumors by comparing scans from the
8616
same patient at different times.  Several papers have reported these
8617
growth rates.
8618

8619
I collected data from a paper by Zhang et al\footnote{Zhang et al,
8620
  Distribution of Renal Tumor Growth Rates Determined by Using Serial
8621
  Volumetric CT Measurements, January 2009 {\it Radiology}, 250,
8622
  137-144.}.  I contacted the authors to see if I could get raw data,
8623
but they refused on grounds of medical privacy.  Nevertheless, I was
8624
able to extract the data I needed by printing one of their graphs and
8625
measuring it with a ruler.
8626

8627
\begin{figure}
8628
% kidney.py
8629
\centerline{\includegraphics[height=2.5in]{figs/kidney2.pdf}}
8630
\caption{CDF of RDT in doublings per year.}
8631
\label{fig.kidney2}
8632
\end{figure}
8633

8634
They report growth rates in reciprocal doubling time (RDT),
8635
which is in units of doublings per year.  So a tumor with $RDT=1$
8636
doubles in volume each year; with $RDT=2$ it quadruples in the same
8637
time, and with $RDT=-1$, it halves.  Figure~\ref{fig.kidney2} shows the
8638
distribution of RDT for 53 patients.
8639
\index{doubling time}
8640

8641
The squares are the data points from the paper; the line is a model I
8642
fit to the data.  The positive tail fits an exponential distribution
8643
well, so I used a mixture of two exponentials.
8644
\index{exponential distribution}
8645
\index{mixture}
8646

8647

8648

8649
\section{A simple model}
8650

8651
It is usually a good idea to start with a simple model before
8652
trying something more challenging.  Sometimes the simple model is
8653
sufficient for the problem at hand, and if not, you can use it
8654
to validate the more complex model.
8655
\index{modeling}
8656

8657
For my simple model, I assume that tumors grow with a constant
8658
doubling time, and that they are three-dimensional in the sense that
8659
if the maximum linear measurement doubles, the volume is multiplied by
8660
eight.
8661

8662
I learned from my correspondent that the time between his discharge
8663
from the military and his diagnosis was 3291 days (about 9 years).
8664
So my first calculation was, ``If this tumor grew at the median
8665
rate, how big would it have been at the date of discharge?''
8666

8667
The median volume doubling time reported by Zhang et al is 811 days.
8668
Assuming 3-dimensional geometry, the doubling time for a linear
8669
measure is three times longer.
8670

8671
\begin{code}
8672
    # time between discharge and diagnosis, in days
8673
    interval = 3291.0
8674

8675
    # doubling time in linear measure is doubling time in volume * 3
8676
    dt = 811.0 * 3
8677

8678
    # number of doublings since discharge
8679
    doublings = interval / dt
8680

8681
    # how big was the tumor at time of discharge (diameter in cm)
8682
    d1 = 15.5
8683
    d0 = d1 / 2.0 ** doublings
8684
\end{code}
8685

8686
You can download the code in this chapter from
8687
\url{http://thinkbayes.com/kidney.py}.  For more information
8688
see Section~\ref{download}.
8689

8690
The result, \py{d0}, is about 6 cm.  So if this tumor formed after
8691
the date of discharge, it must have grown substantially faster than
8692
the median rate.  Therefore I concluded that it is ``more likely than
8693
not'' that this tumor formed before the date of discharge.
8694

8695
In addition, I computed the growth rate that would be implied
8696
if this tumor had formed after the date of discharge.  If we
8697
assume an initial size of 0.1 cm, we can compute the number of
8698
doublings to get to a final size of 15.5 cm:
8699

8700
\begin{code}
8701
    # assume an initial linear measure of 0.1 cm
8702
    d0 = 0.1
8703
    d1 = 15.5
8704

8705
    # how many doublings would it take to get from d0 to d1
8706
    doublings = log2(d1 / d0)
8707

8708
    # what linear doubling time does that imply?
8709
    dt = interval / doublings
8710

8711
    # compute the volumetric doubling time and RDT
8712
    vdt = dt / 3
8713
    rdt = 365 / vdt
8714
\end{code}
8715

8716
\py{dt} is linear doubling time, so \py{vdt} is volumetric
8717
doubling time, and \py{rdt} is reciprocal doubling
8718
time.
8719

8720
The number of doublings, in linear measure, is 7.3, which implies
8721
an RDT of 2.4.  In the data from Zhang et al, only 20\% of tumors
8722
grew this fast during a period of observation.  So again,
8723
I concluded that is ``more likely than not'' that the tumor
8724
formed prior to the date of discharge.
8725

8726
These calculations are sufficient to answer the question as
8727
posed, and on behalf of my correspondent, I wrote a letter explaining
8728
my conclusions to the Veterans' Benefit Administration.
8729
\index{Veterans' Benefit Administration}
8730

8731
Later I told a friend, who is an oncologist, about my results.  He was
8732
surprised by the growth rates observed by Zhang et al, and by what
8733
they imply about the ages of these tumors.  He suggested that the
8734
results might be interesting to researchers and doctors.
8735

8736
But in order to make them useful, I wanted a more general model
8737
of the relationship between age and size.
8738

8739

8740
\section{A more general model}
8741

8742
Given the size of a tumor at time of diagnosis, it would be most
8743
useful to know the probability that the tumor formed before
8744
any given date; in other words, the distribution of ages.
8745
\index{modeling}
8746
\index{simulation}
8747

8748
To find it, I run simulations of tumor growth to get the
8749
distribution of size conditioned on age.  Then we can use
8750
a Bayesian approach to get the
8751
distribution of age conditioned on size.
8752
\index{conditional distribution}
8753

8754
The simulation starts with a small tumor and runs these steps:
8755

8756
\begin{enumerate}
8757

8758
\item Choose a growth rate from the distribution of RDT.
8759

8760
\item Compute the size of the tumor at the end of an interval.
8761

8762
\item Record the size of the tumor at each interval.
8763

8764
\item Repeat until the tumor exceeds the maximum relevant size.
8765

8766
\end{enumerate}
8767

8768
For the initial size I chose 0.3 cm, because carcinomas smaller than
8769
that are less likely to be invasive and less likely to have the blood
8770
supply needed for rapid growth (see
8771
\url{http://en.wikipedia.org/wiki/Carcinoma_in_situ}).
8772
\index{carcinoma}
8773

8774
I chose an interval of 245 days (about 8 months) because that is the
8775
median time between measurements in the data source.
8776

8777
For the maximum size I chose 20 cm.  In the data source, the range of
8778
observed sizes is 1.0 to 12.0 cm, so we are extrapolating beyond
8779
the observed range at each end, but not by far, and not in a way
8780
likely to have a strong effect on the results.
8781

8782
\begin{figure}
8783
% kidney.py
8784
\centerline{\includegraphics[height=2.5in]{figs/kidney4.pdf}}
8785
\caption{Simulations of tumor growth, size vs. time.}
8786
\label{fig.kidney4}
8787
\end{figure}
8788

8789
The simulation is based on one big simplification:
8790
the growth rate is chosen independently during each interval,
8791
so it does not depend on age, size, or growth rate during
8792
previous intervals.
8793
\index{independence}
8794

8795
In Section~\ref{serial} I review these assumptions and
8796
consider more detailed models.  But first let's look at some
8797
examples.
8798

8799
Figure~\ref{fig.kidney4} shows
8800
the size of simulated tumors as a function of
8801
age.  The dashed line at 10 cm shows the range of ages for tumors at
8802
that size: the fastest-growing tumor gets there in 8 years; the
8803
slowest takes more than 35.
8804

8805
I am presenting results in terms of linear measurements, but the
8806
calculations are in terms of volume.  To convert from one to the
8807
other, again, I use the volume of a sphere with the given
8808
diameter.
8809
\index{volume}
8810
\index{sphere}
8811

8812

8813
\section{Implementation}
8814

8815
Here is the kernel of the simulation:
8816
\index{simulation}
8817

8818
\begin{code}
8819
def MakeSequence(rdt_seq, v0=0.01, interval=0.67, vmax=Volume(20.0)):
8820
    seq = v0,
8821
    age = 0
8822

8823
    for rdt in rdt_seq:
8824
        age += interval
8825
        final, seq = ExtendSequence(age, seq, rdt, interval)
8826
        if final > vmax:
8827
            break
8828

8829
    return seq
8830
\end{code}
8831

8832
\verb"rdt_seq" is an iterator that yields
8833
random values from the CDF of growth rate.
8834
\py{v0} is the initial volume in mL.  \py{interval} is the time step
8835
in years.  \py{vmax} is the final volume corresponding to a linear
8836
measurement of 20 cm.
8837
\index{iterator}
8838

8839
\py{Volume} converts from linear measurement in cm to volume
8840
in mL, based on the simplification that the tumor is a sphere:
8841

8842
\begin{code}
8843
def Volume(diameter, factor=4*math.pi/3):
8844
    return factor * (diameter/2.0)**3
8845
\end{code}
8846

8847
\py{ExtendSequence} computes the volume of the tumor at the
8848
end of the interval.
8849

8850
\begin{code}
8851
def ExtendSequence(age, seq, rdt, interval):
8852
    initial = seq[-1]
8853
    doublings = rdt * interval
8854
    final = initial * 2**doublings
8855
    new_seq = seq + (final,)
8856
    cache.Add(age, new_seq, rdt)
8857

8858
    return final, new_seq
8859
\end{code}
8860

8861
\py{age} is the age of the tumor at the end of the interval.
8862
\py{seq} is a tuple that contains the volumes so far.  \py{rdt} is
8863
the growth rate during the interval, in doublings per year.
8864
\py{interval} is the size of the time step in years.
8865

8866
The return values are \py{final}, the volume of the
8867
tumor at the end of the interval, and \verb"new_seq", a new
8868
tuple containing the volumes in \py{seq} plus the new volume
8869
\py{final}.
8870

8871
\py{Cache.Add} records the age and size of each tumor at the end
8872
of each interval, as explained in the next section.
8873
\index{cache}
8874

8875

8876
\section{Caching the joint distribution}
8877

8878
\begin{figure}
8879
% kidney.py
8880
\centerline{\includegraphics[height=2.5in]{figs/kidney8.pdf}}
8881
\caption{Joint distribution of age and tumor size.}
8882
\label{fig.kidney8}
8883
\end{figure}
8884

8885
Here's how the cache works.
8886

8887
\begin{code}
8888
class Cache(object):
8889

8890
    def __init__(self):
8891
        self.joint = thinkbayes.Joint()
8892
\end{code}
8893

8894
\py{joint} is a joint Pmf that records the
8895
frequency of each age-size pair, so it approximates the
8896
joint distribution of age and size.
8897
\index{joint distribution}
8898

8899
At the end of each simulated interval, \py{ExtendSequence} calls
8900
\py{Add}:
8901

8902
\begin{code}
8903
# class Cache
8904

8905
    def Add(self, age, seq):
8906
        final = seq[-1]
8907
        cm = Diameter(final)
8908
        bucket = round(CmToBucket(cm))
8909
        self.joint.Incr((age, bucket))
8910
\end{code}
8911

8912
Again, \py{age} is the age of the tumor, and \py{seq} is the
8913
sequence of volumes so far.
8914

8915
\begin{figure}
8916
% kidney.py
8917
\centerline{\includegraphics[height=2.5in]{figs/kidney6.pdf}}
8918
\caption{Distributions of age, conditioned on size.}
8919
\label{fig.kidney6}
8920
\end{figure}
8921

8922
Before adding the new data to the joint distribution, we use {\tt
8923
  Diameter} to convert from volume to diameter in centimeters:
8924

8925
\begin{code}
8926
def Diameter(volume, factor=3/math.pi/4, exp=1/3.0):
8927
    return 2 * (factor * volume) ** exp
8928
\end{code}
8929

8930
And
8931
\py{CmToBucket} to convert from centimeters to a discrete bucket
8932
number:
8933

8934
\begin{code}
8935
def CmToBucket(x, factor=10):
8936
    return factor * math.log(x)
8937
\end{code}
8938

8939
The buckets are equally spaced on a log scale.  Using \py{factor=10}
8940
yields a reasonable number of buckets; for example,
8941
1 cm maps to bucket 0 and 10 cm maps to bucket 23.
8942
\index{log scale}
8943
\index{bucket}
8944

8945
After running the simulations, we can plot the joint distribution
8946
as a pseudocolor plot, where each cell represents the number of
8947
tumors observed at a given size-age pair.
8948
Figure~\ref{fig.kidney8} shows the joint distribution after 1000
8949
simulations.
8950
\index{pseudocolor plot}
8951

8952

8953

8954
\section{Conditional distributions}
8955

8956
\begin{figure}
8957
% kidney.py
8958
\centerline{\includegraphics[height=2.5in]{figs/kidney7.pdf}}
8959
\caption{Percentiles of tumor age as a function of size.}
8960
\label{fig.kidney7}
8961
\end{figure}
8962

8963
By taking a vertical slice from the joint distribution, we can get the
8964
distribution of sizes for any given age.  By taking a horizontal
8965
slice, we can get the distribution of ages conditioned on size.
8966
\index{conditional distribution}
8967

8968
Here's the code that reads the joint distribution and builds
8969
the conditional distribution for a given size.
8970
\index{joint distribution}
8971

8972
\begin{code}
8973
# class Cache
8974

8975
    def ConditionalCdf(self, bucket):
8976
        pmf = self.joint.Conditional(0, 1, bucket)
8977
        cdf = pmf.MakeCdf()
8978
        return cdf
8979
\end{code}
8980

8981
\verb"bucket" is the integer bucket number corresponding to
8982
tumor size.  \py{Joint.Conditional} computes the
8983
PMF of age conditioned on \py{bucket}.
8984
The result is the CDF of age conditioned on \py{bucket}.
8985

8986
Figure~\ref{fig.kidney6} shows several of these CDFs, for
8987
a range of sizes.  To summarize these distributions, we can
8988
compute percentiles as a function of size.
8989
\index{percentile}
8990

8991
\begin{code}
8992
    percentiles = [95, 75, 50, 25, 5]
8993

8994
    for bucket in cache.GetBuckets():
8995
        cdf = ConditionalCdf(bucket)
8996
        ps = [cdf.Percentile(p) for p in percentiles]
8997
\end{code}
8998

8999
Figure~\ref{fig.kidney7} shows these percentiles for each
9000
size bucket.  The data points are computed from the estimated
9001
joint distribution.  In the model, size and time are discrete,
9002
which contributes numerical errors, so I also show a least
9003
squares fit for each sequence of percentiles.
9004
\index{least squares fit}
9005

9006

9007
\section{Serial Correlation}
9008
\label{serial}
9009

9010
The results so far are based on a number of modeling decisions;
9011
let's review them and consider which ones are the most
9012
likely sources of error:
9013
\index{modeling error}
9014

9015
\begin{itemize}
9016

9017
\item To convert from linear measure to volume, we assume that
9018
  tumors are approximately spherical.  This assumption is probably
9019
  fine for tumors up to a few centimeters, but not for very
9020
  large tumors.
9021
  \index{sphere}
9022

9023
\item The distribution of growth rates in the simulations are based on
9024
  a continuous model we chose to fit the data reported by Zhang et al,
9025
  which is based on 53 patients.  The fit is only approximate and, more
9026
  importantly, a larger sample would yield a
9027
  different distribution.
9028
  \index{growth rate}
9029

9030
\item The growth model does not take into account tumor subtype or
9031
  grade; this assumption is consistent with the conclusion of Zhang et al:
9032
  ``Growth rates in renal tumors of different sizes, subtypes and
9033
  grades represent a wide range and overlap substantially.''
9034
  But with a larger sample, a difference might become apparent.
9035
  \index{tumor type}
9036

9037
\item The distribution of growth rate does not depend on the size of
9038
  the tumor.  This assumption would not be realistic for very
9039
  small and very large tumors, whose growth is limited by blood supply.
9040

9041
  But tumors observed by Zhang et al ranged from 1 to 12 cm, and they
9042
  found no statistically significant relationship between
9043
  size and growth rate.  So if there is a relationship, it is
9044
  likely to be weak, at least in this size range.
9045

9046
\item In the simulations, growth rate during each interval is
9047
  independent of previous growth rates.  In reality it is plausible
9048
  that tumors that have grown quickly in the past are more likely
9049
  to grow quickly.  In other words, there is probably
9050
  a serial correlation in growth rate.
9051
  \index{serial correlation}
9052

9053
\end{itemize}
9054

9055
Of these, the first and last seem the most problematic.  I'll
9056
investigate serial correlation first, then come back to
9057
spherical geometry.
9058

9059
To simulate correlated growth, I wrote a generator\footnote{If you are
9060
  not familiar with Python generators, see
9061
  \url{http://wiki.python.org/moin/Generators}.} that yields a
9062
correlated series from a given Cdf.  Here's how the algorithm works:
9063
\index{generator}
9064

9065
\begin{enumerate}
9066

9067
\item Generate correlated values from a Gaussian distribution.
9068
  This is easy to do because we can compute the distribution
9069
  of the next value conditioned on the previous value.
9070
  \index{Gaussian distribution}
9071

9072
\item Transform each value to its cumulative probability using
9073
  the Gaussian CDF.
9074
  \index{cumulative probability}
9075

9076
\item Transform each cumulative probability to the corresponding value
9077
  using the given Cdf.
9078

9079
\end{enumerate}
9080

9081
Here's what that looks like in code:
9082

9083
\begin{code}
9084
def CorrelatedGenerator(cdf, rho):
9085
    x = random.gauss(0, 1)
9086
    yield Transform(x)
9087

9088
    sigma = math.sqrt(1 - rho**2);
9089
    while True:
9090
        x = random.gauss(x * rho, sigma)
9091
        yield Transform(x)
9092
\end{code}
9093

9094
\py{cdf} is the desired Cdf; \py{rho} is the desired correlation.
9095
The values of \py{x} are Gaussian; \py{Transform} converts them
9096
to the desired distribution.
9097

9098
The first value of \py{x} is Gaussian with mean 0 and standard
9099
deviation 1.  For subsequent values, the mean and standard deviation
9100
depend on the previous value.  Given the previous \py{x}, the mean of the
9101
next value is \py{x * rho}, and the variance is \py{1 - rho**2}.
9102
\index{correlated random value}
9103

9104
\py{Transform} maps from each
9105
Gaussian value, \py{x}, to a value from the given Cdf, \py{y}.
9106

9107
\begin{code}
9108
    def Transform(x):
9109
        p = thinkbayes.GaussianCdf(x)
9110
        y = cdf.Value(p)
9111
        return y
9112
\end{code}
9113

9114
\py{GaussianCdf} computes the CDF of the standard Gaussian
9115
distribution at \py{x}, returning a cumulative probability.
9116
\py{Cdf.Value} maps from a cumulative probability to the
9117
corresponding value in \py{cdf}.
9118

9119
Depending on the shape of \py{cdf}, information can
9120
be lost in transformation, so the actual correlation might be
9121
lower than \py{rho}.  For example, when I generate
9122
10000 values from the distribution of growth rates with
9123
\py{rho=0.4}, the actual correlation is 0.37.
9124
But since we are guessing at the right correlation anyway,
9125
that's close enough.
9126

9127
Remember that \py{MakeSequence} takes an iterator as an argument.
9128
That interface allows it to work with different generators:
9129
\index{generator}
9130

9131
\begin{code}
9132
    iterator = UncorrelatedGenerator(cdf)
9133
    seq1 = MakeSequence(iterator)
9134

9135
    iterator = CorrelatedGenerator(cdf, rho)
9136
    seq2 = MakeSequence(iterator)
9137
\end{code}
9138

9139
In this example, \py{seq1} and \py{seq2} are
9140
drawn from the same distribution, but the values in \py{seq1}
9141
are uncorrelated and the values in \py{seq2} are correlated
9142
with a coefficient of approximately \py{rho}.
9143
\index{serial correlation}
9144

9145
Now we can see what effect serial correlation has on the results;
9146
the following table shows percentiles of age for a 6 cm tumor,
9147
using the uncorrelated generator and a correlated generator
9148
with target $\rho = 0.4$.
9149
\index{percentile}
9150

9151
\begin{table}
9152
\input{tables/kidney_table2}
9153
\caption{Percentiles of tumor age conditioned on size.}
9154
\end{table}
9155

9156
Correlation makes the fastest growing tumors faster and the slowest
9157
slower, so the range of ages is wider.  The difference is modest for
9158
low percentiles, but for the 95th percentile it is more than 6 years.
9159
To compute these percentiles precisely, we would need a better
9160
estimate of the actual serial correlation.
9161

9162
However, this model is sufficient to answer the question
9163
we started with: given a tumor with a linear dimension of
9164
15.5 cm, what is the probability that it formed more than
9165
8 years ago?
9166

9167
Here's the code:
9168

9169
\begin{code}
9170
# class Cache
9171

9172
    def ProbOlder(self, cm, age):
9173
        bucket = CmToBucket(cm)
9174
        cdf = self.ConditionalCdf(bucket)
9175
        p = cdf.Prob(age)
9176
        return 1-p
9177
\end{code}
9178

9179
\py{cm} is the size of the tumor; \py{age} is the age threshold
9180
in years.  \py{ProbOlder} converts size to a bucket number,
9181
gets the Cdf of age conditioned on bucket, and computes the
9182
probability that age exceeds the given value.
9183

9184
With no serial correlation, the probability that a
9185
15.5 cm tumor is older than 8 years is 0.999, or almost certain.
9186
With correlation 0.4, faster-growing tumors are more likely, but
9187
the probability is still 0.995.  Even with correlation 0.8, the
9188
probability is 0.978.
9189

9190
Another likely source of error is the assumption that tumors are
9191
approximately spherical.  For a tumor with linear dimensions 15.5 x 15
9192
cm, this assumption is probably not valid.  If, as seems likely, a
9193
tumor this size
9194
is relatively flat, it might have the same volume as a 6 cm sphere.
9195
With this smaller volume and correlation 0.8, the probability of age
9196
greater than 8 is still 95\%.
9197

9198
So even taking into account modeling errors, it is unlikely that such
9199
a large tumor could have formed less than 8 years prior to the date of
9200
diagnosis.
9201
\index{modeling error}
9202

9203

9204
\section{Discussion}
9205

9206
Well, we got through a whole chapter without using Bayes's theorem or
9207
the \py{Suite} class that encapsulates Bayesian updates.  What
9208
happened?
9209

9210
One way to think about Bayes's theorem is as an algorithm for
9211
inverting conditional probabilities.  Given \p{B|A}, we can compute
9212
\p{A|B}, provided we know \p{A} and \p{B}.  Of course this algorithm
9213
is only useful if, for some reason, it is easier to compute \p{B|A}
9214
than \p{A|B}.
9215

9216
In this example, it is.  By running simulations, we can estimate the
9217
distribution of size conditioned on age, or \p{size|age}.  But it is
9218
harder to get the distribution of age conditioned on size, or
9219
\p{age|size}.  So this seems like a perfect opportunity to use Bayes's
9220
theorem.
9221

9222
The reason I didn't is computational efficiency.  To estimate
9223
\p{size|age} for any given size, you have to run a lot of simulations.
9224
Along the way, you end up computing \p{size|age} for a lot of sizes.
9225
In fact, you end up computing the entire joint distribution of size
9226
and age, \p{size, age}.
9227
\index{joint distribution}
9228

9229
And once you have the joint distribution, you don't really need
9230
Bayes's theorem, you can extract \p{age|size} by taking slices from
9231
the joint distribution, as demonstrated in \py{ConditionalCdf}.
9232
\index{conditional distribution}
9233

9234
So we side-stepped Bayes, but he was with us in spirit.
9235

9236

9237
\chapter{A Hierarchical Model}
9238
\label{hierarchical}
9239

9240

9241
\section{The Geiger counter problem}
9242

9243
I got the idea for the following problem from Tom Campbell-Ricketts,
9244
author of the Maximum Entropy blog at
9245
\url{http://maximum-entropy-blog.blogspot.com}.  And he got the idea
9246
from E.~T.~Jaynes, author of the classic {\em Probability Theory: The
9247
  Logic of Science}:
9248
\index{Jaynes, E.~T.}
9249
\index{Campbell-Ricketts, Tom}
9250
\index{Geiger counter problem}
9251

9252
\begin{quote}
9253
Suppose that a radioactive source emits particles toward
9254
a Geiger counter at an average rate of $r$ particles per second,
9255
but the counter only registers a fraction, $f$, of the particles
9256
that hit it.  If $f$ is 10\% and
9257
the counter registers 15 particles in a one second
9258
interval, what is the posterior distribution of $n$, the actual
9259
number of particles that hit the counter, and $r$, the average
9260
rate particles are emitted?
9261
\end{quote}
9262

9263
To get started on a problem like this, think about the chain of
9264
causation that starts with the parameters of the system and ends
9265
with the observed data:
9266
\index{causation}
9267

9268
\begin{enumerate}
9269

9270
\item The source emits particles at an average rate, $r$.
9271

9272
\item During any given second, the source emits $n$ particles
9273
toward the counter.
9274

9275
\item Out of those $n$ particles, some number, $k$, get counted.
9276

9277
\end{enumerate}
9278

9279
The probability that an atom decays is the same at any point in time,
9280
so radioactive decay is well modeled by a Poisson process.  Given $r$,
9281
the distribution of $n$ is Poisson distribution with parameter $r$.
9282
\index{radioactive decay}
9283
\index{Poisson process}
9284

9285
And if we assume that the probability of detection for each particle
9286
is independent of the others, the distribution of $k$ is the binomial
9287
distribution with parameters $n$ and $f$.
9288
\index{binomial distribution}
9289

9290
Given the parameters of the system, we can find the distribution of
9291
the data.  So we can solve what is called the {\bf forward problem}.
9292
\index{forward problem}
9293

9294
Now we want to go the other way: given the data, we
9295
want the distribution of the parameters.  This is called
9296
the {\bf inverse problem}.  And if you can solve the forward
9297
problem, you can use Bayesian methods to solve the inverse problem.
9298
\index{inverse problem}
9299

9300

9301
\section{Start simple}
9302

9303
\begin{figure}
9304
% jaynes.py
9305
\centerline{\includegraphics[height=2.5in]{figs/jaynes1.pdf}}
9306
\caption{Posterior distribution of $n$ for three values of $r$.}
9307
\label{fig.jaynes1}
9308
\end{figure}
9309

9310
Let's start with a simple version of the problem where we know
9311
the value of $r$.  We are given the value of $f$, so all we
9312
have to do is estimate $n$.
9313

9314
I define a Suite called \py{Detector} that models the behavior
9315
of the detector and estimates $n$.
9316

9317
\begin{code}
9318
class Detector(thinkbayes.Suite):
9319

9320
    def __init__(self, r, f, high=500, step=1):
9321
        pmf = thinkbayes.MakePoissonPmf(r, high, step=step)
9322
        thinkbayes.Suite.__init__(self, pmf, name=r)
9323
        self.r = r
9324
        self.f = f
9325
\end{code}
9326

9327
If the average emission rate is $r$ particles per second, the
9328
distribution of $n$ is Poisson with parameter $r$.
9329
\py{high} and \py{step} determine the upper bound for $n$
9330
and the step size between hypothetical values.
9331
\index{Poisson distribution}
9332

9333
Now we need a likelihood function:
9334
\index{likelihood}
9335

9336
\begin{code}
9337
# class Detector
9338

9339
    def Likelihood(self, data, hypo):
9340
        k = data
9341
        n = hypo
9342
        p = self.f
9343

9344
        return thinkbayes.EvalBinomialPmf(k, n, p)
9345
\end{code}
9346

9347
\py{data} is the number of particles detected, and \py{hypo} is
9348
the hypothetical number of particles emitted, $n$.
9349

9350
If there are actually $n$ particles, and the probability of detecting
9351
any one of them is $f$, the probability of detecting $k$ particles is
9352
given by the binomial distribution.
9353
\index{binomial distribution}
9354

9355
That's it for the Detector.  We can try it out for a range
9356
of values of $r$:
9357

9358
\begin{code}
9359
    f = 0.1
9360
    k = 15
9361

9362
    for r in [100, 250, 400]:
9363
        suite = Detector(r, f, step=1)
9364
        suite.Update(k)
9365
        print suite.MaximumLikelihood()
9366
\end{code}
9367

9368
Figure~\ref{fig.jaynes1} shows the posterior distribution of $n$ for
9369
several given values of $r$.
9370

9371

9372
\section{Make it hierarchical}
9373

9374
In the previous section, we assume $r$ is known.  Now let's
9375
relax that assumption.  I define another Suite, called \py{Emitter},
9376
that models the behavior of the emitter and estimates $r$:
9377

9378
\begin{code}
9379
class Emitter(thinkbayes.Suite):
9380

9381
    def __init__(self, rs, f=0.1):
9382
        detectors = [Detector(r, f) for r in rs]
9383
        thinkbayes.Suite.__init__(self, detectors)
9384
\end{code}
9385

9386
\py{rs} is a sequence of hypothetical value for $r$.  \py{detectors}
9387
is a sequence of Detector objects, one for each value of $r$.  The
9388
values in the Suite are Detectors, so Emitter is a {\bf meta-Suite};
9389
that is, a Suite that contains other Suites as values.
9390
\index{meta-Suite}
9391

9392
To update the Emitter, we have to compute the likelihood of the data
9393
under each hypothetical value of $r$.  But each value of $r$ is
9394
represented by a Detector that contains a range of values for $n$.
9395

9396
To compute the likelihood of the data for a given Detector, we loop
9397
through the values of $n$ and add up the total probability of $k$.
9398
That's what \py{SuiteLikelihood} does:
9399

9400
\begin{code}
9401
# class Detector
9402

9403
    def SuiteLikelihood(self, data):
9404
        total = 0
9405
        for hypo, prob in self.Items():
9406
            like = self.Likelihood(data, hypo)
9407
            total += prob * like
9408
        return total
9409
\end{code}
9410

9411
Now we can write the Likelihood function for the Emitter:
9412

9413
\begin{code}
9414
# class Emitter
9415

9416
    def Likelihood(self, data, hypo):
9417
        detector = hypo
9418
        like = detector.SuiteLikelihood(data)
9419
        return like
9420
\end{code}
9421

9422
Each \py{hypo} is a Detector, so we can invoke
9423
\py{SuiteLikelihood} to get the likelihood of the data under
9424
the hypothesis.
9425

9426
After we update the Emitter, we have to update each of the
9427
Detectors, too.
9428

9429
\begin{code}
9430
# class Emitter
9431

9432
    def Update(self, data):
9433
        thinkbayes.Suite.Update(self, data)
9434

9435
        for detector in self.Values():
9436
            detector.Update()
9437
\end{code}
9438

9439
A model like this, with multiple levels of Suites, is called {\bf
9440
  hierarchical}.  \index{hierarchical model}
9441

9442

9443
\section{A little optimization}
9444

9445
You might recognize \py{SuiteLikelihood}; we saw it
9446
in Section~\ref{suitelike}.  At the time, I pointed out that
9447
we didn't really need it, because the total probability
9448
computed by \py{SuiteLikelihood} is exactly the normalizing
9449
constant computed and returned by \py{Update}.
9450
\index{normalizing constant}
9451

9452
So instead of updating the Emitter and then updating the
9453
Detectors, we can do both steps at the same time, using
9454
the result from \py{Detector.Update} as the likelihood
9455
of Emitter.
9456

9457
Here's the streamlined version of \py{Emitter.Likelihood}:
9458

9459
\begin{code}
9460
# class Emitter
9461

9462
    def Likelihood(self, data, hypo):
9463
        return hypo.Update(data)
9464
\end{code}
9465

9466
And with this version of \py{Likelihood} we can use the
9467
default version of \py{Update}.  So this version has fewer
9468
lines of code, and it runs faster because it does not compute
9469
the normalizing constant twice.
9470
\index{optimization}
9471

9472

9473
\section{Extracting the posteriors}
9474

9475
\begin{figure}
9476
% jaynes.py
9477
\centerline{\includegraphics[height=2.5in]{figs/jaynes2.pdf}}
9478
\caption{Posterior distributions of $n$ and $r$.}
9479
\label{fig.jaynes2}
9480
\end{figure}
9481

9482
After we update the Emitter, we can get the posterior distribution
9483
of $r$ by looping through the Detectors and their probabilities:
9484

9485
\begin{code}
9486
# class Emitter
9487

9488
    def DistOfR(self):
9489
        items = [(detector.r, prob) for detector, prob in self.Items()]
9490
        return thinkbayes.MakePmfFromItems(items)
9491
\end{code}
9492

9493
\py{items} is a list of values of $r$ and their probabilities.
9494
The result is the Pmf of $r$.
9495

9496
To get the posterior distribution of $n$, we have to compute
9497
the mixture of the Detectors.  We can use
9498
\py{thinkbayes.MakeMixture}, which takes a meta-Pmf that maps
9499
from each distribution to its probability.  And that's exactly
9500
what the Emitter is:
9501

9502
\begin{code}
9503
# class Emitter
9504

9505
    def DistOfN(self):
9506
        return thinkbayes.MakeMixture(self)
9507
\end{code}
9508

9509
Figure~\ref{fig.jaynes2} shows the results.  Not surprisingly, the
9510
most likely value for $n$ is 150.  Given $f$ and $n$, the expected
9511
count is $k = f n$, so given $f$ and $k$, the expected value of $n$ is
9512
$k / f$, which is 150.
9513

9514
And if 150 particles are emitted in one second, the most likely value
9515
of $r$ is 150 particles per second.  So the posterior distribution of
9516
$r$ is also centered on 150.
9517

9518
The posterior distributions of $r$ and $n$ are similar;
9519
the only difference is that we are slightly less certain about $n$.
9520
In general, we can be more certain about the long-range emission rate,
9521
$r$, than about the number of particles emitted in any particular second,
9522
$n$.
9523

9524
You can download the code in this chapter from
9525
\url{http://thinkbayes.com/jaynes.py}.  For more information see
9526
Section~\ref{download}.
9527

9528

9529
\section{Discussion}
9530

9531
The Geiger counter problem demonstrates the connection between
9532
causation and hierarchical modeling.  In the example, the
9533
emission rate $r$ has a causal effect on the number of particles,
9534
$n$, which has a causal effect on the particle count, $k$.
9535
\index{Geiger counter problem}
9536
\index{causation}
9537

9538
The hierarchical model reflects the structure of the
9539
system, with causes at the top and effects at the bottom.
9540
\index{hierarchical model}
9541

9542
\begin{enumerate}
9543

9544
\item At the top level, we start with a range of hypothetical
9545
values for $r$.
9546

9547
\item For each value of $r$, we have a range of values for $n$,
9548
and the prior distribution of $n$ depends on $r$.
9549

9550
\item When we update the model, we go bottom-up.  We compute
9551
a posterior distribution of $n$ for each value of $r$, then
9552
compute the posterior distribution of $r$.
9553

9554
\end{enumerate}
9555

9556
So causal information flows down the hierarchy, and inference flows
9557
up.
9558

9559

9560
\section{Exercises}
9561

9562
\begin{exercise}
9563
This exercise is also inspired by an example in Jaynes, {\em
9564
Probability Theory}.
9565

9566
Suppose you buy a mosquito trap that is supposed to reduce the
9567
population of mosquitoes near your house.  Each
9568
week, you empty the trap and count the number of mosquitoes
9569
captured.  After the first week, you count 30 mosquitoes.
9570
After the second week, you count 20 mosquitoes.  Estimate the
9571
percentage change in the number of mosquitoes in your yard.
9572

9573
To answer this question, you have to make some modeling
9574
decisions.  Here are some suggestions:
9575

9576
\begin{itemize}
9577

9578
\item Suppose that each week a large number of mosquitoes, $N$, is bred
9579
in a wetland near your home.
9580

9581
\item During the week, some fraction of
9582
them, $f_1$, wander into your yard, and of those some fraction, $f_2$,
9583
are caught in the trap.
9584

9585
\item Your solution should take into account your prior belief
9586
about how much $N$ is likely to change from one week to the next.
9587
You can do that by adding a level to the hierarchy to
9588
model the percent change in $N$.
9589

9590
\end{itemize}
9591

9592
\end{exercise}
9593

9594

9595
\chapter{Dealing with Dimensions}
9596
\label{species}
9597

9598
\section{Belly button bacteria}
9599

9600
Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen
9601
science project with the goal of identifying bacterial species that
9602
can be found in human navels (\url{http://bbdata.yourwildlife.org}).
9603
The project might seem whimsical, but it is part of an increasing
9604
interest in the human microbiome, the set of microorganisms that live
9605
on human skin and parts of the body.
9606
\index{biodiversity}
9607
\index{belly button}
9608
\index{bacteria}
9609
\index{microbiome}
9610

9611
In their pilot study, BBB2 researchers collected swabs from the navels
9612
of 60 volunteers, used multiplex pyrosequencing to extract and sequence
9613
fragments of 16S rDNA, then identified the species or genus the
9614
fragments came from.  Each identified fragment is called a ``read.''
9615
\index{navel}
9616
\index{rDNA}
9617
\index{pyrosequencing}
9618

9619
We can use these data to answer several related questions:
9620

9621
\begin{itemize}
9622

9623
\item Based on the number of species observed, can we estimate
9624
  the total number of species in the environment?
9625
\index{species}
9626

9627
\item Can we estimate the prevalence of each species; that is, the
9628
  fraction of the total population belonging to each species?
9629
\index{prevalence}
9630

9631
\item If we are planning to collect additional samples, can we predict
9632
  how many new species we are likely to discover?
9633

9634
\item How many additional reads are needed to increase the
9635
  fraction of observed species to a given threshold?
9636

9637
\end{itemize}
9638

9639
These questions make up what is called the {\bf Unseen Species problem}.
9640
\index{Unseen Species problem}
9641

9642

9643
\section{Lions and tigers and bears}
9644

9645
I'll start with a simplified version of the problem where we know that
9646
there are exactly three species.  Let's call them lions, tigers and
9647
bears.  Suppose we visit a wild animal preserve and see 3 lions, 2
9648
tigers and one bear.
9649
\index{lions and tigers and bears}
9650

9651
If we have an equal chance of observing any animal in the preserve,
9652
the number of each species we see is governed by the multinomial
9653
distribution.  If the prevalence of lions and tigers and bears is
9654
\verb"p_lion" and \verb"p_tiger" and \verb"p_bear", the likelihood of
9655
seeing 3 lions, 2 tigers and one bear is proportional to
9656
\index{multinomial distribution}
9657

9658
\begin{code}
9659
p_lion**3 * p_tiger**2 * p_bear**1
9660
\end{code}
9661

9662
An approach that is tempting, but not correct, is to use beta
9663
distributions, as in Section~\ref{beta}, to describe the prevalence of
9664
each species separately.  For example, we saw 3 lions and 3 non-lions;
9665
if we think of that as 3 ``heads'' and 3 ``tails,'' then the posterior
9666
distribution of \verb"p_lion" is:
9667
\index{beta distribution}
9668

9669
\begin{code}
9670
    beta = thinkbayes.Beta()
9671
    beta.Update((3, 3))
9672
    print beta.MaximumLikelihood()
9673
\end{code}
9674

9675
The maximum likelihood estimate for \verb"p_lion" is the observed
9676
rate, 50\%.  Similarly the MLEs for \verb"p_tiger" and \verb"p_bear"
9677
are 33\% and 17\%.
9678
\index{maximum likelihood}
9679

9680
But there are two problems:
9681

9682
\begin{enumerate}
9683

9684
\item We have implicitly used a prior for each species that is uniform
9685
  from 0 to 1, but since we know that there are three species, that
9686
  prior is not correct.  The right prior should have a mean of 1/3,
9687
  and there should be zero likelihood that any species has a
9688
  prevalence of 100\%.
9689

9690
\item The distributions for each species are not independent, because
9691
  the prevalences have to add up to 1.  To capture this dependence, we
9692
  need a joint distribution for the three prevalences.
9693
\index{independence}
9694
\index{joint distribution}
9695

9696
\end{enumerate}
9697

9698
We can use a Dirichlet distribution to solve both of these problems
9699
(see \url{http://en.wikipedia.org/wiki/Dirichlet_distribution}).  In
9700
the same way we used the beta distribution to describe the
9701
distribution of bias for a coin, we can use a Dirichlet
9702
distribution to describe the joint distribution of \verb"p_lion",
9703
\verb"p_tiger" and \verb"p_bear".
9704
\index{beta distribution}
9705
\index{Dirichlet distribution}
9706

9707
The Dirichlet distribution is the multi-dimensional generalization
9708
of the beta distribution.  Instead of two possible outcomes, like
9709
heads and tails, the Dirichlet distribution handles any number of
9710
outcomes: in this example, three species.
9711

9712
If there are \py{n} outcomes, the Dirichlet distribution is
9713
described by \py{n} parameters, written $\alpha_1$ through $\alpha_n$.
9714

9715
Here's the definition, from \py{thinkbayes.py}, of a class that
9716
represents a Dirichlet distribution:
9717
\index{numpy}
9718

9719
\begin{code}
9720
class Dirichlet(object):
9721

9722
    def __init__(self, n):
9723
        self.n = n
9724
        self.params = numpy.ones(n, dtype=numpy.int)
9725
\end{code}
9726

9727
\py{n} is the number of dimensions; initially the parameters
9728
are all 1.  I use a \py{numpy} array to store the parameters
9729
so I can take advantage of array operations.
9730

9731
Given a Dirichlet distribution, the marginal distribution
9732
for each prevalence is a beta distribution, which we can
9733
compute like this:
9734

9735
\begin{code}
9736
    def MarginalBeta(self, i):
9737
        alpha0 = self.params.sum()
9738
        alpha = self.params[i]
9739
        return Beta(alpha, alpha0-alpha)
9740
\end{code}
9741

9742
\py{i} is the index of the marginal distribution we want.
9743
\py{alpha0} is the sum of the parameters; \py{alpha} is the
9744
parameter for the given species.
9745
\index{marginal distribution}
9746

9747
In the example, the prior marginal distribution for each species
9748
is \py{Beta(1, 2)}.  We can compute the prior means like
9749
this:
9750

9751
\begin{code}
9752
    dirichlet = thinkbayes.Dirichlet(3)
9753
    for i in range(3):
9754
        beta = dirichlet.MarginalBeta(i)
9755
        print beta.Mean()
9756
\end{code}
9757

9758
As expected, the prior mean prevalence for each species is 1/3.
9759

9760
To update the Dirichlet distribution, we add the
9761
observations to the parameters like this:
9762

9763
\begin{code}
9764
    def Update(self, data):
9765
        m = len(data)
9766
        self.params[:m] += data
9767
\end{code}
9768

9769
Here \py{data} is a sequence of counts in the same order as {\tt
9770
  params}, so in this example, it should be the number of lions,
9771
tigers and bears.
9772

9773
\py{data} can be shorter than \py{params}; in that
9774
case there are some species that have not been
9775
observed.
9776

9777
Here's code that updates \py{dirichlet} with the observed data and
9778
computes the posterior marginal distributions.
9779

9780
\begin{code}
9781
    data = [3, 2, 1]
9782
    dirichlet.Update(data)
9783

9784
    for i in range(3):
9785
        beta = dirichlet.MarginalBeta(i)
9786
        pmf = beta.MakePmf()
9787
        print i, pmf.Mean()
9788
\end{code}
9789

9790
\begin{figure}
9791
% species.py
9792
\centerline{\includegraphics[height=2.5in]{figs/species1.pdf}}
9793
\caption{Distribution of prevalences for three species.}
9794
\label{fig.species1}
9795
\end{figure}
9796

9797
Figure~\ref{fig.species1} shows the results.  The posterior
9798
mean prevalences are 44\%, 33\%, and 22\%.
9799

9800

9801
\section{The hierarchical version}
9802

9803
We have solved a simplified version of the problem: if we
9804
know how many species there are, we can estimate the prevalence
9805
of each.
9806
\index{prevalence}
9807

9808
Now let's get back to the original problem, estimating the total
9809
number of species.  To solve this problem I'll define a meta-Suite,
9810
which is a Suite that contains other Suites as hypotheses.  In this
9811
case, the top-level Suite contains hypotheses about the number of
9812
species; the bottom level contains hypotheses about prevalences.
9813
\index{hierarchical model}
9814
\index{meta-Suite}
9815

9816
Here's the class definition:
9817

9818
\begin{code}
9819
class Species(thinkbayes.Suite):
9820

9821
    def __init__(self, ns):
9822
        hypos = [thinkbayes.Dirichlet(n) for n in ns]
9823
        thinkbayes.Suite.__init__(self, hypos)
9824
\end{code}
9825

9826
\verb"__init__" takes a list of possible values for \py{n} and
9827
makes a list of Dirichlet objects.
9828

9829
Here's the code that creates the top-level suite:
9830

9831
\begin{code}
9832
    ns = range(3, 30)
9833
    suite = Species(ns)
9834
\end{code}
9835

9836
\py{ns} is the list of possible values for \py{n}.  We have seen 3
9837
species, so there have to be at least that many.  I chose an upper
9838
bound that seems reasonable, but we will check later that the
9839
probability of exceeding this bound is low.  And at least initially
9840
we assume that any value in this range is equally likely.
9841

9842
To update a hierarchical model, you have to update all levels.
9843
Usually you have to update the bottom
9844
level first and work up, but in this case we can
9845
update the top level first:
9846

9847
\begin{code}
9848
#class Species
9849

9850
    def Update(self, data):
9851
        thinkbayes.Suite.Update(self, data)
9852
        for hypo in self.Values():
9853
            hypo.Update(data)
9854
\end{code}
9855

9856
\py{Species.Update} invokes \py{Update} in the parent class,
9857
then loops through the sub-hypotheses and updates them.
9858

9859
Now all we need is a likelihood function:
9860

9861
\begin{code}
9862
# class Species
9863

9864
    def Likelihood(self, data, hypo):
9865
        dirichlet = hypo
9866
        like = 0
9867
        for i in range(1000):
9868
            like += dirichlet.Likelihood(data)
9869

9870
        return like
9871
\end{code}
9872

9873
\py{data} is a sequence of
9874
observed counts; \py{hypo} is a Dirichlet object.
9875
\py{Species.Likelihood} calls
9876
\py{Dirichlet.Likelihood} 1000 times and returns the total.
9877

9878
Why call it 1000 times?  Because {\tt
9879
  Dirichlet.Likelihood} doesn't actually compute the likelihood of the
9880
data under the whole Dirichlet distribution.  Instead, it draws one
9881
sample from the hypothetical distribution and computes the likelihood
9882
of the data under the sampled set of prevalences.
9883

9884
Here's what it looks like:
9885

9886
\begin{code}
9887
# class Dirichlet
9888

9889
    def Likelihood(self, data):
9890
        m = len(data)
9891
        if self.n < m:
9892
            return 0
9893

9894
        x = data
9895
        p = self.Random()
9896
        q = p[:m]**x
9897
        return q.prod()
9898
\end{code}
9899

9900
The length of \py{data} is the number of species observed.  If
9901
we see more species than we thought existed, the likelihood is 0.
9902

9903
\index{multinomial distribution}
9904
Otherwise we select a random set of prevalences, \py{p}, and
9905
compute the multinomial PMF, which is
9906
%
9907
\[ c_x  p_1^{x_1} \cdots p_n^{x_n} \]
9908
%
9909
$p_i$ is the prevalence of the $i$th species, and $x_i$ is the
9910
observed number.  The first term, $c_x$, is the multinomial
9911
coefficient; I leave it out of the computation because it is
9912
a multiplicative factor that depends only
9913
on the data, not the hypothesis, so it gets normalized away
9914
(see \url{http://en.wikipedia.org/wiki/Multinomial_distribution}).
9915
\index{multinomial coefficient}
9916

9917
\py{m} is the number of observed species.
9918
We only need the first \py{m} elements of \py{p};
9919
for the others, $x_i$ is 0, so
9920
$p_i^{x_i}$ is 1, and we can leave them out of the product.
9921

9922

9923
\section{Random sampling}
9924
\label{randomdir}
9925

9926
There are two ways to generate a random sample from a Dirichlet
9927
distribution.  One is to use the marginal beta distributions, but in
9928
that case you have to select one at a time and scale the rest so they
9929
add up to 1 (see
9930
\url{http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation}).
9931
\index{random sample}
9932

9933
A less obvious, but faster, way is to select values from \py{n} gamma
9934
distributions, then normalize by dividing through by the total.
9935
Here's the code:
9936
\index{numpy}
9937
\index{gamma distribution}
9938

9939
\begin{code}
9940
# class Dirichlet
9941

9942
    def Random(self):
9943
        p = numpy.random.gamma(self.params)
9944
        return p / p.sum()
9945
\end{code}
9946

9947
Now we're ready to look at some results.  Here is the code that extracts
9948
the posterior distribution of \py{n}:
9949

9950
\begin{code}
9951
    def DistOfN(self):
9952
        pmf = thinkbayes.Pmf()
9953
        for hypo, prob in self.Items():
9954
            pmf.Set(hypo.n, prob)
9955
        return pmf
9956
\end{code}
9957

9958
\py{DistOfN} iterates
9959
through the top-level hypotheses and accumulates the probability
9960
of each \py{n}.
9961

9962
\begin{figure}
9963
% species.py
9964
\centerline{\includegraphics[height=2.5in]{figs/species2.pdf}}
9965
\caption{Posterior distribution of \py{n}.}
9966
\label{fig.species2}
9967
\end{figure}
9968

9969
Figure~\ref{fig.species2} shows the result.  The most likely value is 4.
9970
Values from 3 to 7 are reasonably likely; after that the probabilities
9971
drop off quickly.  The probability that there are 29 species is
9972
low enough to be negligible; if we chose a higher bound,
9973
we would get nearly the same result.
9974

9975
Remember that this result is based on a uniform prior for \py{n}.  If
9976
we have background information about the number of species in the
9977
environment, we might choose a different prior.  \index{uniform
9978
  distribution}
9979

9980

9981
\section{Optimization}
9982

9983
I have to admit that I am proud of this example.  The Unseen Species
9984
problem is not easy, and I think this solution is simple and clear,
9985
and takes surprisingly few lines of code (about 50 so far).
9986

9987
The only problem is that it is slow.  It's good enough for the example
9988
with only 3 observed species, but not good enough for the belly button
9989
data, with more than 100 species in some samples.
9990

9991
The next few sections present a series of optimizations we need to
9992
make this solution scale.  Before we get into the details, here's
9993
a road map.
9994
\index{optimization}
9995

9996
\begin{itemize}
9997

9998
\item The first step is to recognize that if we update the Dirichlet
9999
  distributions with the same data, the first \py{m} parameters are
10000
  the same for all of them.  The only difference is the number of
10001
  hypothetical unseen species.  So we don't really need \py{n}
10002
  Dirichlet objects; we can store the parameters in the top level of
10003
  the hierarchy.  \py{Species2} implements this optimization.
10004

10005
\item \py{Species2} also uses the same set of random values for all
10006
  of the hypotheses.  This saves time generating random values, but it
10007
  has a second benefit that turns out to be more important: by giving
10008
  all hypotheses the same selection from the sample space, we make
10009
  the comparison between the hypotheses more fair, so it takes
10010
  fewer iterations to converge.
10011

10012
\item Even with these changes there is a major performance problem.
10013
  As the number of observed species increases, the array of random
10014
  prevalences gets bigger, and the chance of choosing one that is
10015
  approximately right becomes small.  So the vast majority of
10016
  iterations yield small likelihoods that don't contribute much to the
10017
  total, and don't discriminate between hypotheses.
10018

10019
  The solution is to do the updates one species at a time.  {\tt
10020
  Species4} is a simple implementation of this strategy using
10021
  Dirichlet objects to represent the sub-hypotheses.
10022

10023
\item Finally, \py{Species5} combines the sub-hypotheses into the top
10024
  level and uses \py{numpy} array operations to speed things up.
10025
\index{numpy}
10026

10027
\end{itemize}
10028

10029
If you are not interested in the details, feel free to skip to
10030
Section~\ref{belly} where we look at results from the belly
10031
button data.
10032

10033

10034
\section{Collapsing the hierarchy}
10035
\label{collapsing}
10036

10037
All of the bottom-level Dirichlet distributions are updated
10038
with the same data, so the first \py{m} parameters are the same for
10039
all of them.
10040
We can eliminate them and merge the parameters into
10041
the top-level suite.  \py{Species2} implements this optimization:
10042
\index{numpy}
10043

10044
\begin{code}
10045
class Species2(object):
10046

10047
    def __init__(self, ns):
10048
        self.ns = ns
10049
        self.probs = numpy.ones(len(ns), dtype=numpy.double)
10050
        self.params = numpy.ones(self.high, dtype=numpy.int)
10051
\end{code}
10052

10053
\py{ns} is the list of hypothetical values for \py{n};
10054
\py{probs} is the list of corresponding probabilities.  And
10055
\py{params} is the sequence of Dirichlet parameters, initially
10056
all 1.
10057

10058
\py{Species2.Update} updates both levels of
10059
the hierarchy: first the probability for each value of \py{n},
10060
then the Dirichlet parameters:
10061
\index{numpy}
10062

10063
\begin{code}
10064
# class Species2
10065

10066
    def Update(self, data):
10067
        like = numpy.zeros(len(self.ns), dtype=numpy.double)
10068
        for i in range(1000):
10069
            like += self.SampleLikelihood(data)
10070

10071
        self.probs *= like
10072
        self.probs /= self.probs.sum()
10073

10074
        m = len(data)
10075
        self.params[:m] += data
10076
\end{code}
10077

10078
\py{SampleLikelihood} returns an array of likelihoods, one for each
10079
value of \py{n}.  \py{like} accumulates the total likelihood for
10080
1000 samples.  \py{self.probs} is multiplied by the total likelihood,
10081
then normalized.  The last two lines, which update the parameters,
10082
are the same as in \py{Dirichlet.Update}.
10083

10084
Now let's look at \py{SampleLikelihood}.  There are two
10085
opportunities for optimization here:
10086

10087
\begin{itemize}
10088

10089
\item When the hypothetical number of species, \py{n},
10090
exceeds the observed number, \py{m}, we only need the first \py{m}
10091
terms of the multinomial PMF; the rest are 1.
10092

10093
\item If the number of species is large, the likelihood of the data
10094
  might be too small for floating-point (see ~\ref{underflow}).  So it
10095
  is safer to compute log-likelihoods.
10096
  \index{log-likelihood} \index{underflow}
10097

10098
\end{itemize}
10099

10100
\index{multinomial distribution}
10101
Again, the multinomial PMF is
10102
%
10103
\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]
10104
%
10105
So the log-likelihood is
10106
%
10107
\[ \log c_x + x_1 \log p_1 + \cdots + x_n \log p_n \]
10108
%
10109
which is fast and easy to compute.  Again, $c_x$
10110
it is the same for all hypotheses, so we can drop it.
10111
Here's the code:
10112
\index{numpy}
10113

10114
\begin{code}
10115
# class Species2
10116

10117
    def SampleLikelihood(self, data):
10118
        gammas = numpy.random.gamma(self.params)
10119

10120
        m = len(data)
10121
        row = gammas[:m]
10122
        col = numpy.cumsum(gammas)
10123

10124
        log_likes = []
10125
        for n in self.ns:
10126
            ps = row / col[n-1]
10127
            terms = data * numpy.log(ps)
10128
            log_like = terms.sum()
10129
            log_likes.append(log_like)
10130

10131
        log_likes -= numpy.max(log_likes)
10132
        likes = numpy.exp(log_likes)
10133

10134
        coefs = [thinkbayes.BinomialCoef(n, m) for n in self.ns]
10135
        likes *= coefs
10136

10137
        return likes
10138
\end{code}
10139

10140
\py{gammas} is an array of values from a gamma distribution; its
10141
length is the largest hypothetical value of \py{n}.  \py{row} is
10142
just the first \py{m} elements of \py{gammas}; since these are the
10143
only elements that depend on the data, they are the only ones we need.
10144
\index{gamma distribution}
10145

10146
For each value of \py{n} we need to divide \py{row} by the
10147
total of the first \py{n} values from \py{gamma}.  \py{cumsum}
10148
computes these cumulative sums and stores them in \py{col}.
10149
\index{cumulative sum}
10150

10151
The loop iterates through the values of \py{n} and accumulates
10152
a list of log-likelihoods.
10153
\index{log-likelihood}
10154

10155
Inside the loop, \py{ps} contains the row of probabilities, normalized
10156
with the appropriate cumulative sum.  \py{terms} contains the
10157
terms of the summation, $x_i \log p_i$, and \verb"log_like" contains
10158
their sum.
10159

10160
After the loop, we want to convert the log-likelihoods to linear
10161
likelihoods, but first it's a good idea to shift them so the largest
10162
log-likelihood is 0; that way the linear likelihoods are not too
10163
small (see ~\ref{underflow}).
10164

10165
Finally, before we return the likelihood, we have to apply a correction
10166
factor, which is the number of ways we could have observed these \py{m}
10167
species, if the total number of species is \py{n}.
10168
\py{BinomialCoefficient} computes ``n choose m'', which is written
10169
$\binom{n}{m}$.
10170
\index{binomial coefficient}
10171

10172
As often happens, the optimized version is less readable and more
10173
error-prone than the original.  But that's one reason I think it is
10174
a good idea to start with the simple version; we can use it for
10175
regression testing.  I plotted results from both versions and confirmed
10176
that they are approximately equal, and that they converge as the
10177
number of iterations increases.
10178
\index{regression testing}
10179

10180

10181
\section{One more problem}
10182

10183
There's more we could do to optimize this code, but there's another
10184
problem we need to fix first.  As the number of observed
10185
species increases, this version gets noisier and takes more
10186
iterations to converge on a good answer.
10187

10188
The problem is that if the prevalences we choose from the Dirichlet
10189
distribution, the \py{ps}, are not at least approximately right,
10190
the likelihood of the observed data is close to zero and almost
10191
equally bad for all values of \py{n}.  So most iterations don't
10192
provide any useful contribution to the total likelihood.  And as the
10193
number of observed species, \py{m}, gets large, the probability of
10194
choosing \py{ps} with non-negligible likelihood gets small.  Really
10195
small.
10196

10197
Fortunately, there is a solution.  Remember that if you observe
10198
a set of data, you can update the prior distribution with the
10199
entire dataset, or you can break it up into a series of updates
10200
with subsets of the data, and the result is the same either way.
10201

10202
For this example, the key is to perform the updates one species at
10203
a time.  That way when we generate a random set of \py{ps}, only
10204
one of them affects the computed likelihood, so the chance of choosing
10205
a good one is much better.
10206

10207
Here's a new version that updates one species at a time:
10208
\index{numpy}
10209

10210
\begin{code}
10211
class Species4(Species):
10212

10213
    def Update(self, data):
10214
        m = len(data)
10215

10216
        for i in range(m):
10217
            one = numpy.zeros(i+1)
10218
            one[i] = data[i]
10219
            Species.Update(self, one)
10220
\end{code}
10221

10222
This version inherits \verb"__init__" from \py{Species}, so it
10223
represents the hypotheses as a list of Dirichlet objects (unlike
10224
\py{Species2}).
10225

10226
\py{Update} loops through the observed species and makes an
10227
array, \py{one}, with all zeros and one species count.  Then
10228
it calls \py{Update} in the parent class, which computes
10229
the likelihoods and updates the sub-hypotheses.
10230

10231
So in the running example, we do three updates.  The first
10232
is something like ``I have seen three lions.''  The second is
10233
``I have seen two tigers and no additional lions.''  And the third
10234
is ``I have seen one bear and no more lions and tigers.''
10235

10236
Here's the new version of \py{Likelihood}:
10237

10238
\begin{code}
10239
# class Species4
10240

10241
    def Likelihood(self, data, hypo):
10242
        dirichlet = hypo
10243
        like = 0
10244
        for i in range(self.iterations):
10245
            like += dirichlet.Likelihood(data)
10246

10247
        # correct for the number of unseen species the new one
10248
        # could have been
10249
        m = len(data)
10250
        num_unseen = dirichlet.n - m + 1
10251
        like *= num_unseen
10252

10253
        return like
10254
\end{code}
10255

10256
This is almost the same as \py{Species.Likelihood}.  The difference
10257
is the factor, \verb"num_unseen".  This correction is necessary
10258
because each time we see a species for the first time, we have to
10259
consider that there were some number of other unseen species that
10260
we might have seen.  For larger values of \py{n} there are more
10261
unseen species that we could have seen, which increases the likelihood
10262
of the data.
10263

10264
This is a subtle point and I have to admit that I did not get it right
10265
the first time.  But again I was able to validate this version
10266
by comparing it to the previous versions.
10267
\index{regression testing}
10268

10269

10270
\section{We're not done yet}
10271

10272
\newcommand{\BigO}[1]{\mathcal{O}(#1)}
10273

10274
Performing the updates one species at a time solves one problem, but
10275
it creates another.  Each update takes time proportional to $k m$,
10276
where $k$ is the number of hypotheses and $m$ is the number of observed
10277
species.  So if we do $m$ updates, the total run time is
10278
proportional to $k m^2$.
10279

10280
But we can speed things up using the same trick we used in
10281
Section~\ref{collapsing}: we'll get rid of the Dirichlet objects and
10282
collapse the two levels of the hierarchy into a single object.  So
10283
here's yet another version of \py{Species}:
10284

10285
\begin{code}
10286
class Species5(Species2):
10287

10288
    def Update(self, data):
10289
        m = len(data)
10290
        for i in range(m):
10291
            self.UpdateOne(i+1, data[i])
10292
            self.params[i] += data[i]
10293
\end{code}
10294

10295
This version inherits \verb"__init__" from \py{Species2}, so
10296
it uses \py{ns} and \py{probs} to represent the distribution
10297
of \py{n}, and \py{params} to represent the parameters of
10298
the Dirichlet distribution.
10299

10300
\py{Update} is similar to what we saw in the previous section.
10301
It loops through the observed species and calls \py{UpdateOne}:
10302
\index{numpy}
10303

10304
\begin{code}
10305
# class Species5
10306

10307
    def UpdateOne(self, i, count):
10308
        likes = numpy.zeros(len(self.ns), dtype=numpy.double)
10309
        for i in range(self.iterations):
10310
            likes += self.SampleLikelihood(i, count)
10311

10312
        unseen_species = [n-i+1 for n in self.ns]
10313
        likes *= unseen_species
10314

10315
        self.probs *= likes
10316
        self.probs /= self.probs.sum()
10317
\end{code}
10318

10319
This function is similar to \py{Species2.Update}, with two changes:
10320

10321
\begin{itemize}
10322

10323
\item The interface is different.  Instead of the whole dataset, we
10324
  get \py{i}, the index of the observed species, and \py{count},
10325
  how many of that species we've seen.
10326

10327
\item We have to apply a correction factor for the number of unseen
10328
  species, as in \py{Species4.Likelihood}.  The difference here is
10329
  that we update all of the likelihoods at once with array
10330
  multiplication.
10331

10332
\end{itemize}
10333

10334
Finally, here's \py{SampleLikelihood}:
10335
\index{numpy}
10336

10337
\begin{code}
10338
# class Species5
10339

10340
    def SampleLikelihood(self, i, count):
10341
        gammas = numpy.random.gamma(self.params)
10342

10343
        sums = numpy.cumsum(gammas)[self.ns[0]-1:]
10344

10345
        ps = gammas[i-1] / sums
10346
        log_likes = numpy.log(ps) * count
10347

10348
        log_likes -= numpy.max(log_likes)
10349
        likes = numpy.exp(log_likes)
10350

10351
        return likes
10352
\end{code}
10353

10354
This is similar to \py{Species2.SampleLikelihood}; the
10355
difference is that each update only includes a single species,
10356
so we don't need a loop.
10357

10358
The runtime of this function is proportional to the number
10359
of hypotheses, $k$.  It runs $m$ times, so the run time of
10360
the update is proportional to $k m$.
10361
And the number of iterations we
10362
need to get an accurate result is usually small.
10363

10364

10365
\section{The belly button data}
10366
\label{belly}
10367

10368
That's enough about lions and tigers and bears.
10369
Let's get back to belly buttons.  To get a sense of what the
10370
data look like, consider subject B1242,
10371
whose sample of 400 reads yielded 61 species with the following
10372
counts:
10373

10374
\begin{code}
10375
92, 53, 47, 38, 15, 14, 12, 10, 8, 7, 7, 5, 5,
10376
4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
10377
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
10378
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
10379
\end{code}
10380

10381
There are a few dominant species that make up a large
10382
fraction of the whole, but many species that yielded only
10383
a single read.  The number of these ``singletons'' suggests
10384
that there are likely to be at least a few unseen species.
10385
\index{species}
10386

10387
In the example with lions and tigers, we assume that each
10388
animal in the preserve is equally likely to be observed.
10389
Similarly, for the belly button data, we assume that each
10390
bacterium is equally likely to yield a read.
10391

10392
In reality, each step in the data-collection
10393
process might introduce biases.  Some species might
10394
be more likely to be picked up by a swab, or to yield identifiable
10395
amplicons.  So when we talk about the prevalence of each species,
10396
we should remember this source of error.
10397
\index{sample bias}
10398

10399
I should also acknowledge that I am using the term ``species''
10400
loosely.  First, bacterial species are not well defined.  Second,
10401
some reads identify a particular species, others only identify
10402
a genus.  To be more precise, I should say ``operational
10403
taxonomic unit'', or OTU.
10404
\index{operational taxonomic unit}
10405
\index{OTU}
10406

10407
Now let's process some of the belly button data.  I define
10408
a class called \py{Subject} to represent information about
10409
each subject in the study:
10410

10411
\begin{code}
10412
class Subject(object):
10413

10414
    def __init__(self, code):
10415
        self.code = code
10416
        self.species = []
10417
\end{code}
10418

10419
Each subject has a string code, like ``B1242'', and a list of
10420
(count, species name) pairs, sorted in increasing order by count.
10421
\py{Subject} provides several methods to make it
10422
easy to access these counts and species names.  You can see the details
10423
in \url{http://thinkbayes.com/species.py}.
10424
  For more information
10425
see Section~\ref{download}.
10426

10427
\begin{figure}
10428
% species.py
10429
\centerline{\includegraphics[height=2.5in]{figs/species-ndist-B1242.pdf}}
10430
\caption{Distribution of \py{n} for subject B1242.}
10431
\label{species-ndist}
10432
\end{figure}
10433

10434
\py{Subject} provides a method named \py{Process} that creates and
10435
updates a \py{Species5} suite,
10436
which represents the distributions of \py{n} and the prevalences.
10437
\index{prevalence}
10438

10439
And \py{Suite2} provides \py{DistOfN}, which returns the posterior
10440
distribution of \py{n}.
10441

10442
\begin{code}
10443
# class Suite2
10444

10445
    def DistN(self):
10446
        items = zip(self.ns, self.probs)
10447
        pmf = thinkbayes.MakePmfFromItems(items)
10448
        return pmf
10449
\end{code}
10450

10451
Figure~\ref{species-ndist} shows the distribution of \py{n} for
10452
subject B1242.  The probability that there are exactly 61 species, and
10453
no unseen species, is nearly zero.  The most likely value is 72, with
10454
90\% credible interval 66 to 79.  At the high end, it is unlikely that
10455
there are as many as 87 species.
10456

10457
Next we compute the posterior distribution of prevalence for
10458
each species.  \py{Species2} provides \py{DistOfPrevalence}:
10459

10460
\begin{code}
10461
# class Species2
10462

10463
    def DistOfPrevalence(self, index):
10464
        metapmf = thinkbayes.Pmf()
10465

10466
        for n, prob in zip(self.ns, self.probs):
10467
            beta = self.MarginalBeta(n, index)
10468
            pmf = beta.MakePmf()
10469
            metapmf.Set(pmf, prob)
10470

10471
        mix = thinkbayes.MakeMixture(metapmf)
10472
        return metapmf, mix
10473
\end{code}
10474

10475
\py{index} indicates which species we want.  For each
10476
\py{n}, we have a different posterior distribution
10477
of prevalence.
10478

10479
\begin{figure}
10480
% species.py
10481
\centerline{\includegraphics[height=2.5in]{figs/species-prev-B1242.pdf}}
10482
\caption{Distribution of prevalences for subject B1242.}
10483
\label{species-prev}
10484
\end{figure}
10485

10486
The loop iterates through the possible values of \py{n}
10487
and their probabilities.  For each value of \py{n} it gets
10488
a Beta object representing the marginal distribution for the
10489
indicated species.  Remember that Beta objects contain the
10490
parameters \py{alpha} and \py{beta}; they don't have
10491
values and probabilities like a Pmf, but they provide \py{MakePmf},
10492
which generates a discrete approximation to the continuous
10493
beta distribution.
10494
\index{Beta object}
10495

10496
\py{metapmf} is a meta-Pmf that contains the distributions
10497
of prevalence, conditioned on \py{n}.  \py{MakeMixture}
10498
combines the meta-Pmf into \py{mix}, which combines the
10499
conditional distributions into a single distribution
10500
of prevalence.
10501
\index{meta-Pmf}
10502
\index{mixture}
10503
\index{MakeMixture}
10504

10505
Figure~\ref{species-prev} shows results for the five
10506
species with the most reads.  The most prevalent species accounts for
10507
23\% of the 400 reads, but since there are almost certainly unseen
10508
species, the most likely estimate for its prevalence is 20\%,
10509
with 90\% credible interval between 17\% and 23\%.
10510

10511

10512
\section{Predictive distributions}
10513

10514
\begin{figure}
10515
% species.py
10516
\centerline{\includegraphics[height=2.5in]{figs/species-rare-B1242.pdf}}
10517
\caption{Simulated rarefaction curves for subject B1242.}
10518
\label{species-rare}
10519
\end{figure}
10520

10521
I introduced the hidden species problem in the form of four related
10522
questions.  We have answered the first two by computing the posterior
10523
distribution for \py{n} and the prevalence of each species.
10524
\index{predictive distribution}
10525

10526
The other two questions are:
10527

10528
\begin{itemize}
10529

10530
\item If we are planning to collect additional reads, can we predict
10531
  how many new species we are likely to discover?
10532

10533
\item How many additional reads are needed to increase the
10534
  fraction of observed species to a given threshold?
10535

10536
\end{itemize}
10537

10538
To answer predictive questions like this we can use the posterior
10539
distributions to simulate possible future events and compute
10540
predictive distributions for the number of species, and fraction of
10541
the total, we are likely to see.
10542

10543
The kernel of these simulations looks like this:
10544
\index{simulation}
10545

10546
\begin{enumerate}
10547

10548
\item Choose \py{n} from its posterior distribution.
10549

10550
\item Choose a prevalence for each species, including possible unseen
10551
  species, using the Dirichlet distribution.
10552
\index{Dirichlet distribution}
10553

10554
\item Generate a random sequence of future observations.
10555

10556
\item Compute the number of new species, \verb"num_new", as a function
10557
  of the number of additional reads, \py{k}.
10558

10559
\item Repeat the previous steps and accumulate the joint distribution
10560
  of \verb"num_new" and \py{k}.
10561
\index{joint distribution}
10562

10563
\end{enumerate}
10564

10565
And here's the code.  \py{RunSimulation} runs a single simulation:
10566

10567
\begin{code}
10568
# class Subject
10569

10570
    def RunSimulation(self, num_reads):
10571
        m, seen = self.GetSeenSpecies()
10572
        n, observations = self.GenerateObservations(num_reads)
10573

10574
        curve = []
10575
        for k, obs in enumerate(observations):
10576
            seen.add(obs)
10577

10578
            num_new = len(seen) - m
10579
            curve.append((k+1, num_new))
10580

10581
        return curve
10582
\end{code}
10583

10584
\verb"num_reads" is the number of additional reads to simulate.
10585
\py{m} is the number of seen species, and \py{seen} is a set of
10586
strings with a unique name for each species.
10587
\py{n} is a random value from the posterior distribution, and
10588
\py{observations} is a random sequence of species names.
10589

10590
Each time through the loop, we add the new observation to
10591
\py{seen} and record the number of reads and the number of
10592
new species so far.
10593

10594
The result of \py{RunSimulation} is a {\bf rarefaction curve},
10595
represented as a list of pairs with the number of reads and
10596
the number of new species.
10597
\index{rarefaction curve}
10598

10599
Before we see the results, let's look at \py{GetSeenSpecies} and
10600
\py{GenerateObservations}.
10601

10602
\begin{code}
10603
#class Subject
10604

10605
    def GetSeenSpecies(self):
10606
        names = self.GetNames()
10607
        m = len(names)
10608
        seen = set(SpeciesGenerator(names, m))
10609
        return m, seen
10610
\end{code}
10611

10612
\py{GetNames} returns the list of species names that appear in
10613
the data files, but for many subjects these names are not unique.
10614
So I use \py{SpeciesGenerator} to extend each name with a serial
10615
number:
10616
\index{generator}
10617

10618
\begin{code}
10619
def SpeciesGenerator(names, num):
10620
    i = 0
10621
    for name in names:
10622
        yield '%s-%d' % (name, i)
10623
        i += 1
10624

10625
    while i < num:
10626
        yield 'unseen-%d' % i
10627
        i += 1
10628
\end{code}
10629

10630
Given a name like \py{Corynebacterium}, \py{SpeciesGenerator} yields
10631
\py{Corynebacterium-1}.  When the list of names is exhausted, it
10632
yields names like \py{unseen-62}.
10633

10634
Here is \py{GenerateObservations}:
10635

10636
\begin{code}
10637
# class Subject
10638

10639
    def GenerateObservations(self, num_reads):
10640
        n, prevalences = self.suite.SamplePosterior()
10641

10642
        names = self.GetNames()
10643
        name_iter = SpeciesGenerator(names, n)
10644

10645
        d = dict(zip(name_iter, prevalences))
10646
        cdf = thinkbayes.MakeCdfFromDict(d)
10647
        observations = cdf.Sample(num_reads)
10648

10649
        return n, observations
10650
\end{code}
10651

10652
Again, \verb"num_reads" is the number of additional reads
10653
to generate.  \py{n} and \py{prevalences} are samples from
10654
the posterior distribution.
10655

10656
\py{cdf} is a Cdf object that maps species names, including the
10657
unseen, to cumulative probabilities.  Using a Cdf makes it efficient
10658
to generate a random sequence of species names.
10659
\index{Cdf}
10660
\index{cumulative probability}
10661

10662
Finally, here is \py{Species2.SamplePosterior}:
10663

10664
\begin{code}
10665
    def SamplePosterior(self):
10666
        pmf = self.DistOfN()
10667
        n = pmf.Random()
10668
        prevalences = self.SamplePrevalences(n)
10669
        return n, prevalences
10670
\end{code}
10671

10672
And \py{SamplePrevalences}, which generates a sample of
10673
prevalences conditioned on \py{n}:
10674
\index{numpy}
10675
\index{random sample}
10676

10677
\begin{code}
10678
# class Species2
10679

10680
    def SamplePrevalences(self, n):
10681
        params = self.params[:n]
10682
        gammas = numpy.random.gamma(params)
10683
        gammas /= gammas.sum()
10684
        return gammas
10685
\end{code}
10686

10687
We saw this algorithm for generating random values from a Dirichlet
10688
distribution in Section~\ref{randomdir}.
10689

10690
Figure~\ref{species-rare} shows 100 simulated rarefaction curves
10691
for subject B1242.  The curves are ``jittered;''
10692
that is, I shifted each curve by a random offset so they
10693
would not all overlap.  By inspection we can estimate that after
10694
400 more reads we are likely to find 2--6 new species.
10695

10696

10697
\section{Joint posterior}
10698

10699
\begin{figure}
10700
% species.py
10701
\centerline{\includegraphics[height=2.5in]{figs/species-cond-B1242.pdf}}
10702
\caption{Distributions of the number of new species conditioned on
10703
the number of additional reads.}
10704
\label{species-cond}
10705
\end{figure}
10706

10707
We can use these simulations to estimate the
10708
joint distribution of \verb"num_new" and \py{k}, and from that
10709
we can get the distribution of \verb"num_new" conditioned on any
10710
value of \py{k}.
10711
\index{joint distribution}
10712

10713
\begin{code}
10714
def MakeJointPredictive(curves):
10715
    joint = thinkbayes.Joint()
10716
    for curve in curves:
10717
        for k, num_new in curve:
10718
            joint.Incr((k, num_new))
10719
    joint.Normalize()
10720
    return joint
10721
\end{code}
10722

10723
\py{MakeJointPredictive} makes a Joint object, which is a
10724
Pmf whose values are tuples.
10725
\index{Joint object}
10726

10727
\py{curves} is a list of rarefaction curves created by
10728
\py{RunSimulation}.  Each curve contains a list of pairs of
10729
\py{k} and \verb"num_new".
10730
\index{rarefaction curve}
10731

10732
The resulting joint distribution is a map from each pair to
10733
its probability of occurring.  Given the joint distribution, we
10734
can use \py{Joint.Conditional}
10735
get the distribution of \verb"num_new" conditioned on \py{k}
10736
(see Section~\ref{conditional}).
10737
\index{conditional distribution}
10738

10739
\py{Subject.MakeConditionals} takes a list of \py{ks}
10740
and computes the conditional distribution of \verb"num_new"
10741
for each \py{k}.  The result is a list of Cdf objects.
10742

10743
\begin{code}
10744
def MakeConditionals(curves, ks):
10745
    joint = MakeJointPredictive(curves)
10746

10747
    cdfs = []
10748
    for k in ks:
10749
        pmf = joint.Conditional(1, 0, k)
10750
        pmf.name = 'k=%d' % k
10751
        cdf = pmf.MakeCdf()
10752
        cdfs.append(cdf)
10753

10754
    return cdfs
10755
\end{code}
10756

10757
Figure~\ref{species-cond} shows the results.  After 100 reads, the
10758
median predicted number of new species is 2; the 90\% credible
10759
interval is 0 to 5.  After 800 reads, we expect to see 3 to 12 new
10760
species.
10761

10762

10763
\section{Coverage}
10764

10765
\begin{figure}
10766
% species.py
10767
\centerline{\includegraphics[height=2.5in]{figs/species-frac-B1242.pdf}}
10768
\caption{Complementary CDF of coverage for a range of additional reads.}
10769
\label{species-frac}
10770
\end{figure}
10771

10772
The last question we want to answer is, ``How many additional reads
10773
are needed to increase the fraction of observed species to a given
10774
threshold?''
10775
\index{coverage}
10776

10777
To answer this question, we need a version of \py{RunSimulation}
10778
that computes the fraction of observed species rather than the
10779
number of new species.
10780

10781
\begin{code}
10782
# class Subject
10783

10784
    def RunSimulation(self, num_reads):
10785
        m, seen = self.GetSeenSpecies()
10786
        n, observations = self.GenerateObservations(num_reads)
10787

10788
        curve = []
10789
        for k, obs in enumerate(observations):
10790
            seen.add(obs)
10791

10792
            frac_seen = len(seen) / float(n)
10793
            curve.append((k+1, frac_seen))
10794

10795
        return curve
10796
\end{code}
10797

10798
Next we loop through each curve and make a dictionary, \py{d},
10799
that maps from the number of additional reads, \py{k}, to
10800
a list of \py{fracs}; that is, a list of values for the
10801
coverage achieved after \py{k} reads.
10802

10803
\begin{code}
10804
    def MakeFracCdfs(self, curves):
10805
        d = {}
10806
        for curve in curves:
10807
            for k, frac in curve:
10808
                d.setdefault(k, []).append(frac)
10809

10810
        cdfs = {}
10811
        for k, fracs in d.iteritems():
10812
            cdf = thinkbayes.MakeCdfFromList(fracs)
10813
            cdfs[k] = cdf
10814

10815
        return cdfs
10816
\end{code}
10817

10818
Then for each value of \py{k} we make a Cdf of \py{fracs}; this Cdf
10819
represents the distribution of coverage after \py{k} reads.
10820

10821
Remember that the CDF tells you the probability of falling below a
10822
given threshold, so the {\em complementary} CDF tells you the
10823
probability of exceeding it.  Figure~\ref{species-frac} shows
10824
complementary CDFs for a range of values of \py{k}.
10825
\index{complementary CDF}
10826

10827
To read this figure, select the level of coverage you want to achieve
10828
along the $x$-axis.  As an example, choose 90\%.
10829
\index{coverage}
10830

10831
Now you can read up the chart to find the probability of achieving
10832
90\% coverage after \py{k} reads.  For example, with 200 reads,
10833
you have about a 40\% chance of getting 90\% coverage.  With 1000 reads, you
10834
have a 90\% chance of getting 90\% coverage.
10835

10836
With that, we have answered the four questions that make up the unseen
10837
species problem.  To validate the algorithms in this chapter with
10838
real data, I had to deal with a few more details.  But
10839
this chapter is already too long, so I won't discuss them here.
10840

10841
You can read about the problems, and how I addressed them, at
10842
\url{http://allendowney.blogspot.com/2013/05/belly-button-biodiversity-end-game.html}.
10843

10844
You can download the code in this chapter from
10845
\url{http://thinkbayes.com/species.py}.
10846
  For more information
10847
see Section~\ref{download}.
10848

10849

10850
\section{Discussion}
10851

10852
The Unseen Species problem is an area of active research, and I
10853
believe the algorithm in this chapter is a novel contribution.  So in
10854
fewer than 200 pages we have made it from the basics of probability to
10855
the research frontier.  I'm very happy about that.
10856

10857
My goal for this book is to present three related ideas:
10858

10859
\begin{itemize}
10860

10861
\item {\bf Bayesian thinking}: The foundation of Bayesian analysis is
10862
  the idea of using probability distributions to represent uncertain
10863
  beliefs, using data to update those distributions, and using the
10864
  results to make predictions and inform decisions.
10865

10866
\item {\bf A computational approach}: The premise of this book is that
10867
  it is easier to understand Bayesian analysis using computation
10868
  rather than math, and easier to implement Bayesian methods with
10869
  reusable building blocks that can be rearranged to solve real-world
10870
  problems quickly.
10871

10872
\item {\bf Iterative modeling}: Most real-world problems involve
10873
  modeling decisions and trade-offs between realism and complexity.
10874
  It is often impossible to know ahead of time what factors should be
10875
  included in the model and which can be abstracted away.  The best
10876
  approach is to iterate, starting with simple models and adding
10877
  complexity gradually, using each model to validate the others.
10878

10879
\end{itemize}
10880

10881
These ideas are versatile and powerful; they are applicable to
10882
problems in every area of science and engineering, from simple
10883
examples to topics of current research.
10884

10885
If you made it this far, you should be prepared to apply these
10886
tools to new problems relevant to your work.  I hope you find
10887
them useful; let me know how it goes!
10888

10889

10890

10891
%\chapter{Future chapters}
10892

10893
%Bayesian regression (hybrid version with resampling?)
10894
%\url{http://www.reddit.com/r/statistics/comments/1647yj/which_regression_technique/}
10895

10896
%Change point detection:
10897

10898
%Deconvolution: Estimating round trip times
10899

10900
%Bayesian search
10901

10902
%Extension of the Euro problem: evaluating reddit items and redditors
10903
%\url{http://www.reddit.com/r/statistics/comments/15rurz/question_about_continuous_bayesian_inference/}
10904

10905
%Charles Darwin problem (capture-tag-recapture)
10906
%\url{http://maximum-entropy-blog.blogspot.com/2012/04/capture-recapture-and-charles-darwin.html}
10907

10908
% http://camdp.com/blogs/how-solve-price-rights-showdown
10909

10910
% https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
10911

10912
% http://blog.yhathq.com/posts/estimating-user-lifetimes-with-pymc.html
10913

10914
\printindex
10915

10916
\end{document}
10917

10918
Product

Resources

Company