Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/thinkbayes2
Path: blob/master/book/book.tex
1900 views
1
% LaTeX source for ``Think Bayes: Bayesian Statistics Made Simple''
2
% Second edition
3
% Copyright 2020 Allen B. Downey.
4
5
% License: Creative Commons
6
% Attribution-NonCommercial-ShareAlike 4.0 International
7
% http://creativecommons.org/licenses/by-nc-sa/4.0/
8
%
9
10
\documentclass[12pt]{book}
11
12
\title{Think Bayes}
13
\author{Allen B. Downey}
14
15
\newcommand{\thetitle}{Think Bayes}
16
\newcommand{\thesubtitle}{Bayesian Statistics Made Simple}
17
\newcommand{\theauthor}{Allen B. Downey}
18
\newcommand{\theversion}{Version 2.1.0}
19
20
%%%% Both LATEX and PLASTEX
21
22
\usepackage{booktabs}
23
24
\usepackage{graphicx}
25
\usepackage{setspace}
26
27
\usepackage{amsmath}
28
\usepackage{amsthm}
29
30
% format end of chapter excercises
31
\newtheoremstyle{exercise}
32
{12pt} % space above
33
{12pt} % space below
34
{} % body font
35
{} % indent amount
36
{\bfseries} % head font
37
{} % punctuation
38
{12pt} % head space
39
{} % custom head
40
\theoremstyle{exercise}
41
\newtheorem{exercise}{Exercise}[chapter]
42
43
\newif\ifplastex
44
\plastexfalse
45
46
%%%% PLASTEX ONLY
47
\ifplastex
48
49
\makeindex
50
51
\usepackage{localdef}
52
53
\usepackage{url}
54
\renewcommand{\href}[2]{\url{#1}}
55
56
\makeatletter
57
\newcount\anchorcnt
58
\newcommand*{\Anchor}[1]{%
59
\@bsphack%
60
\Hy@GlobalStepCount\anchorcnt%
61
\edef\@currentHref{anchor.\the\anchorcnt}%
62
\Hy@raisedlink{\hyper@anchorstart{\@currentHref}\hyper@anchorend}%
63
\M@gettitle{}\label{#1}%
64
\@esphack%
65
}
66
\makeatother
67
68
% code listing environments:
69
% we don't need these for plastex because they get replaced
70
% by preprocess.py
71
%\newenvironment{code}{\begin{verbatim}}{\end{verbatim}}
72
%\newenvironment{stdout}{\begin{verbatim}}{\end{verbatim}}
73
74
% inline syntax formatting
75
%\newcommand{\py}{\verb}%}
76
%\newcommand{\py}{\texttt}%}
77
\newcommand{\py}[1]{{\tt #1}}%{
78
\newcommand{\textcolor}[1]{\relax}
79
80
%%%% LATEX/HTML ONLY
81
\else
82
83
%BEGIN LATEX
84
\usepackage{comment}
85
\excludecomment{htmlonly}
86
\includecomment{latexonly}
87
%END LATEX
88
89
\input{latexonly.tex}
90
91
\fi
92
93
%%%% END OF PREAMBLE
94
\begin{document}
95
96
\frontmatter
97
98
%%%% PLASTEX ONLY
99
\ifplastex
100
101
\maketitle
102
103
%%%% LATEX/HTML ONLY
104
\else
105
106
\begin{latexonly}
107
108
%--half title-------------------------------------------------
109
\thispagestyle{empty}
110
111
\begin{flushright}
112
\vspace*{2.0in}
113
114
\begin{spacing}{3}
115
{\huge \thetitle} \\
116
{\Large \thesubtitle}
117
\end{spacing}
118
119
\vspace{0.25in}
120
121
\theversion
122
123
\vfill
124
\end{flushright}
125
126
%--verso------------------------------------------------------
127
\newpage
128
\thispagestyle{empty}
129
130
\quad
131
132
%--title page-------------------------------------------------
133
\newpage
134
\thispagestyle{empty}
135
136
\begin{flushright}
137
\vspace*{2.0in}
138
139
\begin{spacing}{3}
140
{\huge \thetitle} \\
141
{\Large \thesubtitle}
142
\end{spacing}
143
144
\vspace{0.25in}
145
146
\theversion
147
148
\vspace{1in}
149
150
{\Large \theauthor}
151
152
\vspace{0.5in}
153
154
{\Large Green Tea Press}
155
156
{\small Needham, Massachusetts}
157
158
\vfill
159
\end{flushright}
160
161
%--copyright--------------------------------------------------
162
\newpage
163
\thispagestyle{empty}
164
165
Copyright \copyright ~2020 \theauthor.
166
167
\vspace{0.2in}
168
169
\begin{flushleft}
170
Green Tea Press \\
171
9 Washburn Ave \\
172
Needham, MA 02492
173
\end{flushleft}
174
175
Permission is granted to copy, distribute, and/or modify this work under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, which is available at \url{https://creativecommons.org/licenses/by-nc-sa/4.0/}.
176
177
178
The \LaTeX\ source for this book is available from
179
\url{http://greenteapress.com/thinkbayes2}.
180
181
%--table of contents------------------------------------------
182
183
\cleardoublepage
184
\setcounter{tocdepth}{1}
185
\tableofcontents
186
187
\end{latexonly}
188
189
%--HTML title page--------------------------------------------
190
191
\begin{htmlonly}
192
193
\vspace{1em}
194
195
{\Large \thetitle: \thesubtitle}
196
197
{\large \theauthor}
198
199
\theversion
200
201
\vspace{1em}
202
203
Copyright \copyright ~2020 \theauthor.
204
205
Permission is granted to copy, distribute, and/or modify this document
206
under the terms of the Creative Commons
207
Attribution-NonCommercial-ShareAlike 4.0 International
208
Unported License, which is available at
209
\url{http://creativecommons.org/licenses/by-nc-sa/4.0/}.
210
211
\setcounter{chapter}{-1}
212
213
\end{htmlonly}
214
215
%-------------------------------------------------------------
216
217
%%%% END OF THE PART WE SKIP FOR PLASTEX
218
\fi
219
220
221
\chapter{Preface}
222
\label{preface}
223
224
\section{My theory, which is mine}
225
226
The premise of this book, and the other books in the {\it Think X}
227
series, is that if you know how to program, you
228
can use that skill to learn other topics.
229
230
Most books on Bayesian statistics use mathematical notation and
231
present ideas in terms of mathematical concepts like calculus.
232
This book uses Python code instead of math, and discrete approximations
233
instead of continuous mathematics. As a result, what would
234
be an integral in a math book becomes a summation, and
235
most operations on probability distributions are simple loops.
236
237
I think this presentation is easier to understand, at least for people with
238
programming skills. It is also more general, because when we make
239
modeling decisions, we can choose the most appropriate model without
240
worrying too much about whether the model lends itself to conventional
241
analysis.
242
243
Also, it provides a smooth development path from simple examples to
244
real-world problems. Chapter~\ref{estimation} is a good example. It
245
starts with a simple example involving dice, one of the staples of
246
basic probability. From there it proceeds in small steps to the
247
locomotive problem, which I borrowed from Mosteller's {\it
248
Fifty Challenging Problems in Probability with Solutions}, and from
249
there to the German tank problem, a famously successful application of
250
Bayesian methods during World War II.
251
252
253
\section{Modeling and approximation}
254
255
Most chapters in this book are motivated by a real-world problem, so
256
they involve some degree of modeling. Before we can apply Bayesian
257
methods (or any other analysis), we have to make decisions about which
258
parts of the real-world system to include in the model and which
259
details we can abstract away. \index{modeling}
260
261
For example, in Chapter~\ref{prediction}, the motivating problem is to
262
predict the winner of a hockey game. I model goal-scoring as a
263
Poisson process, which implies that a goal is equally likely at any
264
point in the game. That is not exactly true, but it is probably a
265
good enough model for most purposes.
266
\index{Poisson process}
267
268
In Chapter~\ref{evidence} the motivating problem is interpreting SAT
269
scores (the SAT is a standardized test used for college admissions in
270
the United States). I start with a simple model that assumes that all
271
SAT questions are equally difficult, but in fact the designers of the
272
SAT deliberately include some questions that are relatively easy and
273
some that are relatively hard. I present a second model that accounts
274
for this aspect of the design, and show that it doesn't have a big
275
effect on the results after all.
276
277
I think it is important to include modeling as an explicit part
278
of problem solving because it reminds us to think about modeling
279
errors (that is, errors due to simplifications and assumptions
280
of the model).
281
282
Many of the methods in this book are based on discrete distributions,
283
which makes some people worry about numerical errors. But for
284
real-world problems, numerical errors are almost always
285
smaller than modeling errors.
286
287
Furthermore, the discrete approach often allows better modeling
288
decisions, and I would rather have an approximate solution
289
to a good model than an exact solution to a bad model.
290
291
On the other hand, continuous methods sometimes yield performance
292
advantages---for example by replacing a linear- or quadratic-time
293
computation with a constant-time solution.
294
295
So I recommend a general process with these steps:
296
297
\begin{enumerate}
298
299
\item While you are exploring a problem, start with simple models and
300
implement them in code that is clear, readable, and demonstrably
301
correct. Focus your attention on good modeling decisions, not
302
optimization.
303
304
\item Once you have a simple model working, identify the
305
biggest sources of error. You might need to increase the number of
306
values in a discrete approximation, or increase the number of
307
iterations in a Monte Carlo simulation, or add details to the model.
308
309
\item If the performance of your solution is good enough for your
310
application, you might not have to do any optimization. But if you
311
do, there are two approaches to consider. You can review your code
312
and look for optimizations; for example, if you cache previously
313
computed results you might be able to avoid redundant computation.
314
Or you can look for analytic methods that yield computational
315
shortcuts.
316
317
\end{enumerate}
318
319
One benefit of this process is that Steps 1 and 2 tend to be fast, so you
320
can explore several alternative models before investing heavily in any
321
of them.
322
323
Another benefit is that if you get to Step 3, you will be starting
324
with a reference implementation that is likely to be correct,
325
which you can use for regression testing (that is, checking that the
326
optimized code yields the same results, at least approximately).
327
\index{regression testing}
328
329
330
\section{Working with the code}
331
\label{codeinfo}
332
333
There are several ways you can work with the code in this book:
334
335
\begin{itemize}
336
337
\item If you don't have a programming environment where you can run Jupyter notebooks, and you don't want to create one, you can run the notebooks on Colab, which is an online service provided by Google. Colab let's you run Jupyter notebooks in a browser without installing anything.
338
339
\item If you have Python and Jupyter installed, you can download the code and run it on your computer.
340
341
\end{itemize}
342
343
To run the notebooks on Colab, you can follow the links at the end of each chapter, or you can start from \url{}, which has links to all of the notebooks.
344
345
If you already have Python and Jupyter, you can download the code from
346
my Git repository, at \url{https://github.com/AllenDowney/ThinkBayes}. Git is a version control system that allows you to keep track of the files that make up a project.
347
A collection of files under Git's control is
348
called a ``repository''.
349
GitHub is a hosting service that provides storage for Git repositories and a convenient web interface.
350
351
\index{repository}
352
\index{Git}
353
\index{GitHub}
354
355
The GitHub homepage for my repository provides several ways to download the code:
356
357
\begin{itemize}
358
359
\item You can create a copy of my repository
360
on GitHub by pressing the {\sf Fork} button. If you don't already
361
have a GitHub account, you'll need to create one. After forking, you'll
362
have your own repository on GitHub that you can use to keep track
363
of code you write while working on this book. Then you can
364
clone the repo, which means that you copy the files
365
to your computer.
366
\index{fork}
367
368
\item Or you could clone
369
my repository. You don't need a GitHub account to do this, but you
370
won't be able to write your changes back to GitHub.
371
\index{clone}
372
373
\item If you don't want to use Git at all, you can download the files
374
in a Zip file using the button in the lower-right corner of the
375
GitHub page. Or you can download the Zip file from \url{}.
376
377
\end{itemize}
378
379
If you don't have Python and Jupyter installed already, I recommend you install Anaconda, which is a free Python distribution that includes
380
all the packages you'll need to run the code (and lots more).
381
I found Anaconda easy to install. By default it installs files in your home directory, so you don't need administrator privileges. You can download Anaconda from \url{https://www.anaconda.com/products/individual}.
382
\index{Anaconda}
383
384
If you install Anaconda, you will have most of the packages you need to run the code in this book.
385
To make sure you have everything you need (and the right versions), the best option is to create a Conda environment. And the best way to do that is to use the command line.
386
If you are not familiar with the command line, you might want to run the notebooks on Colab.
387
388
\begin{enumerate}
389
390
\item After downloading my repository, you should have a directory named \py{ThinkBayes2}. Use \py{cd} to move into that directory.
391
392
\item Use \py{ls} to confirm that you have a file named \py{environment.yml}. It lists the packages you need.
393
394
\item Run the following command to create an environment:
395
396
\begin{verbatim}
397
conda env create -f environment.yml
398
\end{verbatim}
399
400
\item Run the following command to activate the environment you just created:
401
402
\begin{verbatim}
403
conda activate ThinkBayes2
404
\end{verbatim}
405
406
\item To test your environment and make sure it has everything we need, run the following command:
407
408
\begin{verbatim}
409
python test_env.py
410
\end{verbatim}
411
412
\end{enumerate}
413
414
If you don't want to create an environment just for this book, you can install what you need using Conda.
415
The following commands should get everything you need:
416
417
\begin{verbatim}
418
conda install python jupyter pandas scipy matplotlib
419
pip install empiricaldist
420
\end{verbatim}
421
422
If you don't want to use Anaconda, you will need the following
423
packages:
424
425
\begin{itemize}
426
427
\item Jupyter to run the notebooks, \url{https://jupyter.org/};
428
\index{Jupyter}
429
430
\item NumPy for basic numerical computation, \url{http://www.numpy.org/};
431
\index{NumPy}
432
433
\item SciPy for scientific computation, \url{http://www.scipy.org/};
434
\index{SciPy}
435
436
\item Pandas for working with data, \url{https://pandas.pydata.org/};
437
\index{Pandas}
438
439
\item matplotlib for visualization, \url{http://matplotlib.org/};
440
\index{matplotlib}
441
442
\item empiricaldist for representing distributions, \url{};
443
\index{empiricaldist}.
444
%TODO: add this URL
445
446
\end{itemize}
447
448
Although these are commonly used packages, they are not included with
449
all Python installations, and they can be hard to install in some
450
environments. If you have trouble installing them, I
451
recommend using Anaconda or one of the other Python distributions
452
that include these packages.
453
\index{installation}
454
455
456
457
\section{Code style}
458
459
Experienced Python programmers will notice that the code in this
460
book does not comply with PEP 8, which is the most common
461
style guide for Python (\url{http://www.python.org/dev/peps/pep-0008/}).
462
\index{PEP 8}
463
464
Specifically, PEP 8 calls for lowercase function names with
465
underscores between words, \verb"like_this". In this book and
466
the accompanying code, function and method names begin with
467
a capital letter and use camel case, \verb"LikeThis".
468
469
I broke this rule because I developed some of the code
470
while I was a Visiting Scientist at Google, so I followed
471
the Google style guide, which deviates from PEP 8 in a few
472
places. Once I got used to Google style, I found that I liked
473
it. And at this point, it would be too much trouble to change.
474
475
Also on the topic of style, I write ``Bayes's theorem''
476
with an {\it s} after the apostrophe, which is preferred in some
477
style guides and deprecated in others. I don't have a strong
478
preference. I had to choose one, and this is the one I chose.
479
480
And finally one typographical note: throughout the book, I use
481
PMF and CDF for the mathematical concept of a probability
482
mass function or cumulative distribution function, and Pmf and Cdf
483
to refer to the Python objects I use to represent them.
484
485
486
\section{Prerequisites}
487
488
There are several excellent modules for doing Bayesian statistics in
489
Python, including \py{pymc} and OpenBUGS. I chose not to use them
490
for this book because you need a fair amount of background knowledge
491
to get started with these modules, and I want to keep the
492
prerequisites minimal. If you know Python and a little bit about
493
probability, you are ready to start this book.
494
495
Chapter~\ref{intro} is about probability and Bayes's theorem; it has
496
no code. Chapter~\ref{compstat} introduces \py{Pmf}, a thinly disguised
497
Python dictionary I use to represent a probability mass function
498
(PMF). Then Chapter~\ref{estimation} introduces \py{Suite}, a kind
499
of Pmf that provides a framework for doing Bayesian updates.
500
501
In some of the later chapters, I use
502
analytic distributions including the Gaussian (normal) distribution,
503
the exponential and Poisson distributions, and the beta distribution.
504
In Chapter~\ref{species} I break out the less-common Dirichlet
505
distribution, but I explain it as I go along. If you are not familiar
506
with these distributions, you can read about them on Wikipedia. You
507
could also read the companion to this book, {\it Think Stats}, or an
508
introductory statistics book (although I'm afraid most of them take
509
a mathematical approach that is not particularly helpful for practical
510
purposes).
511
512
513
514
\section*{Contributor List}
515
516
If you have a suggestion or correction, please send email to
517
{\it downey@allendowney.com}. If I make a change based on your
518
feedback, I will add you to the contributor list
519
(unless you ask to be omitted).
520
\index{contributors}
521
522
If you include at least part of the sentence the
523
error appears in, that makes it easy for me to search. Page and
524
section numbers are fine, too, but not as easy to work with.
525
Thanks!
526
527
\small
528
529
\begin{itemize}
530
531
\item First, I have to acknowledge David MacKay's excellent book,
532
{\it Information Theory, Inference, and Learning Algorithms}, which is
533
where I first came to understand Bayesian methods. With his
534
permission, I use several problems from
535
his book as examples.
536
537
\item This book also benefited from my interactions with Sanjoy
538
Mahajan, especially in fall 2012, when I audited his class on
539
Bayesian Inference at Olin College.
540
541
\item I wrote parts of this book during project nights with the Boston
542
Python User Group, so I would like to thank them for their
543
company and pizza.
544
545
\item Olivier Yiptong sent several helpful suggestions.
546
547
\item Yuriy Pasichnyk found several errors.
548
549
\item Kristopher Overholt sent a long list of corrections and suggestions.
550
551
\item Max Hailperin suggested a clarification in Chapter~\ref{intro}.
552
553
\item Markus Dobler pointed out that drawing cookies from a bowl
554
with replacement is an unrealistic scenario.
555
556
\item In spring 2013, students in my class, Computational Bayesian
557
Statistics, made many helpful corrections and suggestions: Kai
558
Austin, Claire Barnes, Kari Bender, Rachel Boy, Kat Mendoza, Arjun
559
Iyer, Ben Kroop, Nathan Lintz, Kyle McConnaughay, Alec Radford,
560
Brendan Ritter, and Evan Simpson.
561
562
\item Greg Marra and Matt Aasted helped me clarify the discussion of
563
{\it The Price is Right} problem.
564
565
\item Marcus Ogren pointed out that the original statement of the
566
locomotive problem was ambiguous.
567
568
\item Jasmine Kwityn and Dan Fauxsmith at O'Reilly Media proofread the
569
book and found many opportunities for improvement.
570
571
\item Linda Pescatore found a typo and made some helpful suggestions.
572
573
\item Tomasz Miasko sent many excellent corrections and suggestions.
574
575
% ENDCONTRIB
576
577
\end{itemize}
578
579
Other people who spotted typos and small errors include
580
Tom Pollard,
581
Paul A. Giannaros,
582
Jonathan Edwards,
583
George Purkins,
584
Robert Marcus,
585
Ram Limbu,
586
James Lawry,
587
Ben Kahle,
588
Jeffrey Law, and
589
Alvaro Sanchez.
590
591
\normalsize
592
593
\newpage
594
595
% TABLE OF CONTENTS
596
\begin{latexonly}
597
598
\tableofcontents
599
600
\newpage
601
602
\end{latexonly}
603
604
% START THE BOOK
605
\mainmatter
606
607
\newcommand{\PMF}{\mathrm{PMF}}
608
\newcommand{\PDF}{\mathrm{PDF}}
609
\newcommand{\CDF}{\mathrm{CDF}}
610
\newcommand{\ICDF}{\mathrm{ICDF}}
611
612
\newcommand{\p}[1]{\ensuremath{\mathrm{p}(#1)}}
613
\newcommand{\odds}[1]{\ensuremath{\mathrm{o}(#1)}}
614
\newcommand{\T}[1]{\mbox{#1}}
615
\newcommand{\AND}{~\mathrm{and}~}
616
\newcommand{\NOT}{\mathrm{not}~}
617
618
619
\chapter{Bayes's Theorem}
620
\label{intro}
621
622
\section{Conditional probability}
623
624
The fundamental idea behind all Bayesian statistics is Bayes's theorem,
625
which is surprisingly easy to derive, provided that you understand
626
conditional probability. So we'll start with probability, then
627
conditional probability, then Bayes's theorem, and on to Bayesian
628
statistics.
629
\index{conditional probability}
630
\index{probability!conditional}
631
632
A probability is a number between 0 and 1 (including both) that
633
represents a degree of belief in a fact or prediction. The value
634
1 represents certainty that a fact is true, or that a prediction
635
will come true. The value 0 represents certainty
636
that the fact is false.
637
\index{degree of belief}
638
639
Intermediate values represent degrees of certainty. The value 0.5,
640
often written as 50\%, means that a predicted outcome is
641
as likely to happen as not.
642
For example, the probability that a tossed coin lands ``heads'' is close to 50\%.
643
\index{coin toss}
644
645
A conditional probability is a probability based on some relevant information. For example, suppose I toss two coins.
646
The probability that both coins land heads is 25\%.
647
648
But suppose I toss two coins and, without showing you the result, tell you that at least one of the coins in heads.
649
What is the probability that both are heads?
650
The answer is 1/3.
651
652
Here's how I got that: when I toss the coins, there are four equally likely outcomes: heads-heads, heads-tails, tails-heads, and tails-tails.
653
When I tell you that at least one coin is heads, that eliminates one outcome, tails-tails.
654
655
The remaining outcomes are heads-heads, heads-tails, and tails-heads, and they are still equally likely.
656
So the probability of heads-heads is 1/3.
657
658
That argument is correct, but if you don't find it entirely convincing, we'll come back to this problem and solve it more carefully using Bayes's Theorem.
659
660
In this example, we computed the conditional probability of two heads, given the information that at least one coin is heads.
661
662
The usual notation for conditional probability is $\p{A|B}$, which
663
is the probability of $A$ given that $B$ is true. In this
664
example, $A$ represents the two heads, and $B$ is the condition that at least one coin is heads.
665
666
667
\section{Conjoint probability}
668
669
{\bf Conjoint probability} is a fancy way to say the probability that
670
two things are true. I'll use the notation $\p{A \AND B}$ to mean the
671
probability that $A$ and $B$ are both true.
672
673
\index{conjoint probability}
674
\index{probability!conjoint}
675
676
If you learned about probability in the context of coin tosses and
677
dice, you might have learned the following formula:
678
%
679
\[ \p{A \AND B} = \p{A}~\p{B} \quad\quad\mbox{WARNING: not always true}\]
680
%
681
For example, if I toss two coins, and $A$ means the first coin lands
682
face up, and $B$ means the second coin lands face up, then $\p{A} =
683
\p{B} = 0.5$, and sure enough, $\p{A \AND B} = \p{A}~\p{B} = 0.25$.
684
685
But this formula only works because in this case $A$ and $B$ are
686
independent; that is, knowing the first outcome does
687
not change the probability of the second. Or, more formally,
688
\p{B|A} = \p{B}.
689
\index{independence}
690
\index{dependence}
691
692
Here is a different example where the outcomes are not independent.
693
Suppose that $A$ means that it rains today and $B$ means that it
694
rains tomorrow. If I know that it rained today, it is more likely
695
that it will rain tomorrow, so $\p{B|A} > \p{B}$.
696
697
In general, the probability of a conjunction is
698
%
699
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
700
%
701
for any $A$ and $B$. So if the chance of rain on any given day
702
is 0.5, the chance of rain on two consecutive days is not
703
0.25, but probably a bit higher.
704
705
706
\section{The cookie problem}
707
\label{cookie}
708
709
\index{Bayes's theorem}
710
\index{cookie problem}
711
712
We'll get to Bayes's theorem soon, but I want to motivate it with an
713
example called the cookie problem.\footnote{Based on an example from
714
\url{http://en.wikipedia.org/wiki/Bayes'_theorem} that is no longer
715
there.}
716
717
\begin{quote}
718
Suppose there are two bowls of cookies.
719
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
720
Bowl 2 contains 20 of each.
721
722
Now suppose you choose one of the bowls at random and, without
723
looking, select a cookie at random.
724
The cookie is vanilla.
725
What is the probability that it came from Bowl 1?
726
\end{quote}
727
728
This is a conditional probability; we want $\p{\T{Bowl 1} |
729
\T{vanilla}}$, but it is not obvious how to compute it. If I asked a
730
different question---the probability of a vanilla cookie given Bowl
731
1---it would be easy:
732
%
733
\[ \p{\T{vanilla} | \T{Bowl 1}} = 3/4 \]
734
%
735
Sadly, $\p{A|B}$ is {\em not} the same as $\p{B|A}$, but there
736
is a way to get from one to the other: Bayes's theorem.
737
738
739
\section{Bayes's theorem}
740
741
\index{Bayes's theorem!derivation}
742
\index{conjunction}
743
744
Here's how we derive Bayes's theorem.
745
We'll start with the probability of a conjunction:
746
%
747
\[ \p{A \AND B} = \p{A}~\p{B|A} \]
748
%
749
Since we have not said anything about what $A$ and $B$ mean, they
750
are interchangeable.
751
Interchanging them yields
752
%
753
\[ \p{B \AND A} = \p{B}~\p{A|B} \]
754
%
755
Also, conjunction is commutative; that is
756
%
757
\[ \p{A \AND B} = \p{B \AND A} \]
758
%
759
That's all we need. Pulling those pieces together, we get
760
%
761
\[ \p{B}~\p{A|B} = \p{A}~\p{B|A} \]
762
%
763
Which means there are two ways to compute the conjunction.
764
If you have $\p{A}$, you multiply by the conditional
765
probability $\p{B|A}$.
766
Or you can do it the other way around; if you
767
know \p{B}, you multiply by $\p{A|B}$.
768
769
Finally we divide through by $\p{B}$:
770
%
771
\[ \p{A|B} = \frac{\p{A}~\p{B|A}}{\p{B}} \]
772
%
773
And that's Bayes's theorem! It might not look like much, but
774
it turns out to be surprisingly powerful.
775
776
For example, we can use it to solve the cookie problem. I'll write
777
$B_1$ for the hypothesis that the cookie came from Bowl 1
778
and $V$ for the vanilla cookie. Plugging in Bayes's theorem
779
we get
780
%
781
\[ \p{B_1|V} = \frac{\p{B_1}~\p{V|B_1}}{\p{V}} \]
782
%
783
The term on the left is what we want: the probability of Bowl 1, given
784
that we chose a vanilla cookie. The terms on the right are:
785
786
\begin{itemize}
787
788
\item $\p{B_1}$: This is the probability that we chose Bowl 1, unconditioned by what kind of cookie we got. Since the problem says we chose a bowl at random, we can assume $\p{B_1} = 1/2$.
789
790
\item $\p{V|B_1}$: This is the probability of getting a vanilla cookie
791
from Bowl 1, which is 3/4.
792
793
\item $\p{V}$: This is the probability of drawing a vanilla cookie from
794
either bowl. Since we had an equal chance of choosing either bowl
795
and the bowls contain the same number of cookies, we had the same
796
chance of choosing any cookie. Between the two bowls there are
797
50 vanilla and 30 chocolate cookies, so $\p{V} = 5/8$.
798
799
\end{itemize}
800
801
Putting it together, we have
802
%
803
\[ \p{B_1|V} = \frac{(1/2)~(3/4)}{5/8} \]
804
%
805
which reduces to 3/5. So the vanilla cookie is evidence in favor of
806
the hypothesis that we chose Bowl 1, because vanilla cookies are more
807
likely to come from Bowl 1.
808
809
\index{evidence}
810
811
This example demonstrates one use of Bayes's theorem: it provides
812
a strategy to get from \p{B|A} to \p{A|B}. This strategy is useful
813
in cases, like the cookie problem, where it is easier to compute
814
the terms on the right side of Bayes's theorem than the term on the
815
left.
816
817
818
\section{The diachronic interpretation}
819
820
There is another way to think of Bayes's theorem: it gives us a
821
way to update the probability of a hypothesis, $H$, in light of
822
some body of data, $D$.
823
824
\index{diachronic interpretation}
825
826
This way of thinking about Bayes's theorem is called the
827
{\bf diachronic interpretation}. ``Diachronic'' means that something
828
is happening over time; in this case, the probability of the hypotheses changes over time as we see new data.
829
830
Rewriting Bayes's theorem with $H$ and $D$ yields:
831
%
832
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
833
%
834
In this interpretation, each term has a name:
835
836
\index{prior}
837
\index{posterior}
838
\index{likelihood}
839
\index{normalizing constant}
840
841
\begin{itemize}
842
843
\item \p{H} is the probability of the hypothesis before we see
844
the data, called the prior probability, or just {\bf prior}.
845
846
\item \p{H|D} is what we want to compute, the probability of
847
the hypothesis after we see the data, called the {\bf posterior}.
848
849
\item \p{D|H} is the probability of the data under the hypothesis,
850
called the {\bf likelihood}.
851
852
\item \p{D} is the {\bf total probability of the data}, under any hypothesis.
853
854
\end{itemize}
855
856
Sometimes we can compute the prior based on background information. For example, the cookie problem specifies that we choose a bowl at random with equal probability.
857
858
In other cases the prior is subjective; that is, reasonable people
859
might disagree, either because they use different background
860
information or because they interpret the same information
861
differently.
862
863
\index{subjective prior}
864
865
The likelihood is usually the easiest part to compute. In the
866
cookie problem, if we know which bowl the cookie came from,
867
we find the probability of a vanilla cookie by counting.
868
869
Computing the total probability of the data can be tricky. It is supposed to be the probability of seeing the data under any hypothesis at all, but in the most general case it is hard to nail down what that means.
870
871
Most often we simplify things by specifying a set of hypotheses
872
that are:
873
874
\index{mutually exclusive}
875
\index{collectively exhaustive}
876
877
\begin{description}
878
879
\item[Mutually exclusive:] At most one hypothesis in
880
the set can be true, and
881
882
\item[Collectively exhaustive:] There are no other
883
possibilities; at least one of the hypotheses has to be true.
884
885
\end{description}
886
887
In the cookie problem, there are only two hypotheses---the cookie
888
came from Bowl 1 or Bowl 2---and they are mutually exclusive and
889
collectively exhaustive.
890
891
\index{total probability}
892
893
In that case we can compute \p{D} using the law of total probability,
894
which says that if there are two exclusive ways that something
895
might happen, you can add up the probabilities like this:
896
%
897
\[ \p{D} = \p{B_1}~\p{D|B_1} + \p{B_2}~\p{D|B_2} \]
898
%
899
Plugging in the values from the cookie problem, we have
900
%
901
\[ \p{D} = (1/2)~(3/4) + (1/2)~(1/2) = 5/8 \]
902
%
903
which is what we computed earlier by mentally combining the two
904
bowls.
905
906
907
\section{Bayes Tables}
908
909
In the cookie problem we can compute the probability of the data directly, but that's not always the case. In fact, computing the total probability of the data is often the hardest part of the problem.
910
911
Fortunately, there is another way to solve problems like this that makes it easier: the Bayes table.
912
913
You can write a Bayes table on paper or use a spreadsheet, but for this example I'll use a Pandas DataFrame.
914
915
First I'll make empty DataFrame with one row for each hypothesis:
916
917
\begin{code}
918
import pandas as pd
919
table = pd.DataFrame(index=['Bowl 1', 'Bowl 2'])
920
\end{code}
921
922
Then I'll add columns for the prior probabilities and likelihoods.
923
924
\begin{code}
925
table['prior'] = 1/2, 1/2
926
table['likelihood'] = 3/4, 1/2
927
\end{code}
928
929
This table shows the results so far:
930
931
\input{tables/table01-01}
932
933
If we multiply the priors by the likelihoods, the results are {\bf unnormalized posteriors}; they are proportional to the posterior probabilities, but they don't add up to 1.
934
935
We can normalize them by computing the total probability of the data and dividing through.
936
937
\begin{code}
938
table['unnorm'] = table['prior'] * table['likelihood']
939
prob_data = table['unnorm'].sum()
940
table['posterior'] = table['unnorm'] / prob_data
941
\end{code}
942
943
The following table shows the result:
944
945
\input{tables/table01-02}
946
947
The posterior probability for Bowl 1 is 0.6, which is what we got using Bayes's Theorem. As a bonus, we also get the posterior probability for Bowl 2, which is 0.4.
948
949
950
\section{The Dice Problem}
951
\label{dice}
952
953
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
954
I choose one of the dice at random, roll it, and report that the outcome is a 1.
955
What is the probability that I chose the 6-sided die?
956
957
In this example, there are three hypotheses with equal prior probabilities.
958
The data is my report that the outcome is a 1.
959
Under the hypothesis that I chose the 6-sided die, the probability of the data is 1/6.
960
If I chose the 8-sided die, the probability is 1/8, and if I chose the 12-sided die, it's 1/12.
961
962
Plugging the priors and likelihoods into a Bayes table, I get these results:
963
964
\input{tables/table01-03}
965
966
The posterior probability that I chose the 6-sided die is $4/9$.
967
968
As this example demonstrates, the table method works with more than two hypotheses.
969
970
971
972
\section{The Monty Hall problem}
973
974
\index{Monty Hall problem}
975
976
Monty Hall was the original host of the game show {\em Let's Make a
977
Deal}.
978
The Monty Hall problem is based on one of the regular
979
games on the show.
980
If you are a contestant, here's how the game works:
981
982
\begin{itemize}
983
984
\item Monty shows you three closed doors numbered 1, 2, and 3.
985
He tells you that there is a prize behind each door.
986
987
\item One prize is valuable (traditionally a car), the other two are less valuable (traditionally goats).
988
989
\item The object of the game is to guess which door has the car.
990
If you guess right, you get to keep the car.
991
992
\end{itemize}
993
994
Suppose you pick Door 1.
995
Before opening the door you chose, Monty opens Door 3 and reveals a
996
goat.
997
Then Monty offers you the option to stick with your original
998
choice or switch to the remaining unopened door.
999
1000
To maximize your chance of winning the car, should you stick with Door 1 or switch to Door 2?
1001
1002
To answer this question, we have to make some assumptions about the behavior of the host:
1003
1004
\begin{enumerate}
1005
1006
\item Monty always opens a door and offers you the option to switch.
1007
1008
\item He never opens the door you picked or the door with the car.
1009
1010
\item If you choose the door with the car, he chooses one of the other doors at random.
1011
1012
\end{enumerate}
1013
1014
Under these assumptions, you are better off switching.
1015
If you stick, you win $1/3$ of the time.
1016
If you switch, you win $2/3$ of the time.
1017
1018
If you have not encountered this problem before, you might find the answer surprising.
1019
You would not be alone; many people have the strong intuition that it doesn't matter if you stick or switch.
1020
There are two doors left, they reason, so the chance that the car
1021
is behind Door A is 50\%.
1022
But that is wrong.
1023
1024
To see why, it might help to use a Bayes table.
1025
We start with three hypotheses: the car might be behind Door 1, 2, or 3.
1026
According to the statement of the problem, the prior probability for each door is 1/3.
1027
1028
The data is that Monty opened Door 3 and revealed a goat.
1029
So let's consider the probability of the data under each hypothesis:
1030
1031
\begin{itemize}
1032
1033
\item If the car were behind Door 3, Monty would not have opened it, so the probability of the data under this hypothesis is 0.
1034
1035
\item If the car were behind Door 2, Monty would have to open Door 3, so the probability of the data under this hypothesis is 1.
1036
1037
\item If the car were behind Door 1, Monty would choose Door 2 or 3 at random; the probability he would open Door 3 is $1/2$.
1038
1039
\end{itemize}
1040
1041
Once we figure out prior probabilities and likelihoods, the Bayes table does the rest. Here is the result:
1042
1043
\input{tables/table01-04}
1044
1045
After Monty opens Door 3, the posterior probability of Door 1 is $1/3$; the posterior probability of Door 2 is $2/3$.
1046
1047
\index{divide-and-conquer}
1048
1049
As this example shows, our intuition for probability is not always reliable.
1050
Bayes's Theorem provides a divide-and-conquer strategy that can help:
1051
1052
\begin{enumerate}
1053
1054
\item First, write down the hypotheses and the data.
1055
1056
\item Next, figure out the prior probabilities.
1057
1058
\item Finally, compute the likelihood of the data under each hypothesis.
1059
1060
\end{enumerate}
1061
1062
The Bayes table does the rest.
1063
1064
\section{Summary}
1065
1066
In this chapter...
1067
1068
In the next chapter
1069
1070
But first you might want to work on these exercises.
1071
1072
1073
\section{Exercises}
1074
1075
The code for this chapter is in \py{chap01.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
1076
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap01.ipynb}.
1077
1078
The notebook provides space where you can work on the following problems.
1079
1080
\begin{exercise}
1081
1082
Suppose you have two coins in a box.
1083
One is a normal coin with heads on one side and tails on the other, and one is a trick coin with heads on both sides.
1084
1085
You choose a coin at random and see that one of the sides is heads.
1086
What is the probability that you chose the trick coin?
1087
1088
\end{exercise}
1089
1090
1091
\begin{exercise}
1092
1093
Suppose you meet someone and learn that they have two children.
1094
You ask if either child is a girl and they say yes.
1095
What is the probability that both children are girls?
1096
1097
Hint: Start with four equally likely hypotheses.
1098
1099
\end{exercise}
1100
1101
1102
\begin{exercise}
1103
1104
There are many variations of the Monty Hall problem (see \url{https://en.wikipedia.org/wiki/Monty_Hall_problem}).
1105
1106
For example, suppose that Monty always chooses Door 2 if he can and
1107
only chooses Door 3 if he has to (because the car is behind Door 2).
1108
1109
If you choose Door 1 and Monty opens Door 2, what is the probability the car is behind Door 3?
1110
1111
If you choose Door 1 and Monty opens Door 3, what is the probability the car is behind Door 2?
1112
1113
\end{exercise}
1114
1115
\newcommand{\MM}{M\&M}
1116
1117
\begin{exercise}
1118
1119
\MM's are small candy-coated chocolates that come in a variety of
1120
colors. Mars, Inc., which makes \MM's, changes the mixture of
1121
colors from time to time.
1122
\index{M and M problem}
1123
1124
In 1995, they introduced blue \MM's. Before then, the color mix in
1125
a bag of plain \MM's was 30\% Brown, 20\% Yellow, 20\% Red, 10\%
1126
Green, 10\% Orange, 10\% Tan. Afterward it was 24\% Blue , 20\%
1127
Green, 16\% Orange, 14\% Yellow, 13\% Red, 13\% Brown.
1128
1129
Suppose a friend of mine has two bags of \MM's, and he tells me
1130
that one is from 1994 and one from 1996. He won't tell me which is
1131
which, but he gives me one \MM~from each bag. One is yellow and
1132
one is green. What is the probability that the yellow one came
1133
from the 1994 bag?
1134
1135
\end{exercise}
1136
1137
1138
\chapter{Computational Statistics}
1139
\label{compstat}
1140
1141
\section{Distributions}
1142
\label{distributions}
1143
1144
In statistics a {\bf distribution} is a set of values and their
1145
corresponding probabilities.
1146
\index{distribution}
1147
1148
For example, if you toss a coin, there are two possible outcomes with approximately equal probabilities.
1149
1150
If you roll a six-sided die, the set of possible
1151
values is the numbers 1 to 6, and the probability associated
1152
with each value is 1/6.
1153
\index{dice}
1154
1155
To represent distributions, we'll use a library called \py{empiricaldist}.
1156
An ``empirical'' distribution is based on data, as opposed to a theoretical distribution.
1157
1158
This library provides a class called \py{Pmf}, which represents
1159
a {\bf probability mass function}.
1160
1161
\index{probability mass function}
1162
\index{Pmf class}
1163
1164
\py{empiricaldist} is available from the Python Package Index (PyPI).
1165
You can download it from \url{https://pypi.org/project/empiricaldist/} or install it with \py{pip}.
1166
For more details, see Section~\ref{codeinfo}.
1167
1168
To use \py{Pmf} you can import it like this:
1169
1170
\begin{code}
1171
from empiricaldist import Pmf
1172
\end{code}
1173
1174
The following example makes a \py{Pmf} that represents the outcome of a coin toss.
1175
1176
\begin{code}
1177
coin = Pmf()
1178
coin['heads'] = 1/2
1179
coin['tails'] = 1/2
1180
\end{code}
1181
1182
The two outcomes have the same probability, $1/2$.
1183
1184
This example makes a \py{Pmf} that represents the distribution
1185
of outcomes of a six-sided die:
1186
1187
\begin{code}
1188
die = Pmf()
1189
for x in [1,2,3,4,5,6]:
1190
die[x] = 1
1191
\end{code}
1192
1193
\py{Pmf} creates an empty \py{Pmf} with no values.
1194
The \py{for} loop adds the values $1$ through $6$, each with ``probability'' $1$.
1195
1196
In this \py{Pmf}, the probabilities don't add up to 1, so they are not really probabilities.
1197
We can use \py{normalize} to make them add up to 1.
1198
1199
\begin{code}
1200
die.normalize()
1201
\end{code}
1202
1203
Another way make a \py{Pmf} is to provide a sequence of values.
1204
1205
\begin{code}
1206
die = Pmf.from_seq([1,2,3,4,5,6])
1207
\end{code}
1208
1209
In this example, every value appears once, so they all have the same probability.
1210
More generally, values can appear more than once, as in this example:
1211
1212
\begin{code}
1213
letters = Pmf.from_seq(list('Mississippi'))
1214
\end{code}
1215
1216
The following table shows the results.
1217
1218
\input{tables/table02-01}
1219
1220
The \py{qs} are the values or ``quantities'' in the distribution; the \py{ps} are the corresponding probabilities. In the word ``Mississippi'', about 36\% of the letters are ``s''.
1221
1222
The \py{Pmf} class inherits from a Pandas \py{Series}, so anything you can do with a \py{Series}, you can also do with a \py{Pmf}.
1223
1224
For example, you can use the bracket operator to look up a value and returns the corresponding probability.
1225
1226
\begin{code}
1227
letters['s']
1228
\end{code}
1229
1230
However, if you ask for the probability of a value that's not in the distribution, you get a \py{KeyError}.
1231
1232
You can also call a \py{Pmf} as if it were a function, with a value in parentheses.
1233
1234
\begin{code}
1235
letters('s')
1236
\end{code}
1237
1238
If the value is in the distribution the results are the same.
1239
But if the value is not in the distribution, the result is $0$, not an error.
1240
1241
As these examples shows, the values in a \py{Pmf} can be integers or strings.
1242
In general, they can be any type that can be stores in the index of a Pandas Series.
1243
1244
If you are familiar with Pandas, that will help you work with \py{Pmf} objects.
1245
But I will explain what you need to know as we go along.
1246
1247
1248
\section{The Cookie Problem}
1249
1250
In this section I'll use a \py{Pmf} to solve the cookie problem from Section~\ref{cookie}.
1251
Here's the statement of the problem again:
1252
1253
\begin{quote}
1254
Suppose there are two bowls of cookies.
1255
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
1256
Bowl 2 contains 20 of each.
1257
1258
Now suppose you choose one of the bowls at random and, without
1259
looking, select a cookie at random. The cookie is vanilla. What is
1260
the probability that it came from Bowl 1?
1261
\end{quote}
1262
1263
1264
Here's a \py{Pmf} that represents the two hypotheses and their prior probabilities:
1265
\index{cookie problem}
1266
1267
\begin{code}
1268
prior = Pmf.from_seq(['Bowl 1', 'Bowl 2'])
1269
1270
\end{code}
1271
1272
This distribution, which contains the prior probability for each hypothesis,
1273
is called (wait for it) the {\bf prior distribution}.
1274
\index{prior distribution}
1275
1276
To update the distribution based on new data (the vanilla cookie),
1277
we multiply the priors by the likelihoods. The likelihood
1278
of drawing a vanilla cookie from Bowl 1 is 3/4. The likelihood
1279
for Bowl 2 is 1/2.
1280
1281
\begin{code}
1282
likelihood_vanilla = [0.75, 0.5]
1283
posterior = prior * likelihood_vanilla
1284
\end{code}
1285
1286
The result is the unnormalized posteriors.
1287
We can use \py{normalize} to compute the posterior probabilities:
1288
1289
\begin{code}
1290
posterior.normalize()
1291
\end{code}
1292
1293
The return value from \py{normalize} is the total probability of the data, which is $5/8$.
1294
1295
Finally, we can get the posterior probability for Bowl 1:
1296
1297
\begin{code}
1298
posterior('Bowl 1')
1299
\end{code}
1300
1301
And the answer is 0.6.
1302
This distribution, which contains the posterior probability for each hypothesis, is called (wait now) the {\bf posterior distribution}.
1303
\index{posterior distribution}
1304
1305
One benefit of using \py{Pmf} objects is that it is easy to do successive updates with more data.
1306
For example, suppose you put the first cookie back (so the contents of the bowls don't change) and draw again from the same bowl.
1307
If the second cookie is also vanilla, we can do a second update like this:
1308
1309
\begin{code}
1310
posterior *= likelihood_vanilla
1311
posterior.normalize()
1312
\end{code}
1313
1314
Now the posterior probability for Bowl 1 is almost 70\%.
1315
But suppose we do the same thing again and get a chocolate cookie.
1316
Here's the update.
1317
1318
\begin{code}
1319
likelihood_chocolate = [0.25, 0.5]
1320
posterior *= likelihood_chocolate
1321
posterior.normalize()
1322
\end{code}
1323
1324
Now the posterior probability for Bowl 1 is about 53\%.
1325
After two vanilla cookies and one chocolate, the posterior probabilities are close to 50/50.
1326
1327
1328
\section{More Bowls}
1329
\label{morebowls}
1330
1331
Next let's solve a cookie problem with 101 bowls:
1332
1333
\begin{itemize}
1334
1335
\item Bowl 0 contains no vanilla cookies,
1336
1337
\item Bowl 1 contains 1\% vanilla cookies,
1338
1339
\item Bowl 2 contains 2\% vanilla cookies,
1340
1341
\end{itemize}
1342
1343
and so on, up to
1344
1345
\begin{itemize}
1346
1347
\item Bowl 99 contains 99\% vanilla cookies, and
1348
1349
\item Bowl 100 contains all vanilla cookies.
1350
1351
\end{itemize}
1352
1353
As in the previous version, there are only two kinds of cookies, vanilla and chocolate. So Bowl 0 is all chocolate cookies, Bowl 1 is 99\% chocolate, and so on.
1354
1355
\begin{figure}
1356
% chap02soln.ipynb
1357
\centerline{\includegraphics[width=4in]{figs/fig02-01.pdf}}
1358
\caption{Prior and posterior distributions for the 101 Bowls problem.}
1359
\label{fig02-01}
1360
\end{figure}
1361
1362
Suppose we choose a bowl at random, choose a cookie at random, and it turns out to be vanilla. What is the probability that the cookie came from Bowl \py{x}, for each value of \py{x}?
1363
1364
To solve this problem, I'll use \py{np.arange} to represent 101 hypotheses, numbered from 0 to 100.
1365
1366
\begin{code}
1367
hypos = np.arange(101)
1368
\end{code}
1369
1370
The result is a NumPy array, which we can use to make the prior distribution:
1371
1372
\begin{code}
1373
prior = Pmf(1, hypos)
1374
prior.normalize()
1375
\end{code}
1376
1377
As this example shows, we an initialize a \py{Pmf} with two parameters.
1378
The first parameter is the prior probability; the second parameter is a sequence of values.
1379
Because the probabilities are all the same, we only have to provide one of them.
1380
It gets ``broadcast'' across the hypotheses.
1381
1382
Since all hypotheses have the same prior probability, this distribution is {\bf uniform}.
1383
1384
The likelihood of the data is the fraction of vanilla cookies in each bowl, which we can calculate using \py{hypos}:
1385
1386
\begin{code}
1387
likelihood_vanilla = hypos/100
1388
\end{code}
1389
1390
Now we can compute the posterior distribution in the usual way:
1391
1392
\begin{code}
1393
posterior1 = prior * likelihood_vanilla
1394
posterior1.normalize()
1395
\end{code}
1396
1397
Figure~\ref{fig02-01} (top) shows the prior distribution and the posterior distribution after one vanilla cookie.
1398
Bowl 0 has been eliminated, because it contains no vanilla cookies, and Bowl 100 is the most likely.
1399
The posterior distribution is a line because the the likelihoods are proportional to the bowl numbers.
1400
1401
Now suppose we put the cookie back, draw again from the same bowl, and get another vanilla cookie.
1402
Here's the update after the second cookie:
1403
1404
\begin{code}
1405
posterior2 = posterior1 * likelihood_vanilla
1406
posterior2.normalize()
1407
\end{code}
1408
1409
Figure~\ref{fig02-01} (middle) shows the result.
1410
Because the likelihood function is a line, the posterior after two cookies is a parabola.
1411
1412
At this point the high-numbered bowls are the most likely because they contain the most vanilla cookies, and the low-numbered bowls have been all but eliminated.
1413
1414
But suppose we draw again and get a chocolate cookie.
1415
Here's the update:
1416
1417
\begin{code}
1418
likelihood_chocolate = 1 - hypos/100
1419
posterior3 = posterior2 * likelihood_chocolate
1420
posterior3.normalize()
1421
\end{code}
1422
1423
Figure~\ref{fig02-01} (bottom) shows the result.
1424
Now Bowl 100 has been eliminated because it contains no chocolare cookies.
1425
But the high-numbered bowls are still more likely than the low-numbered bowls, because we have seen more vanilla cookies than chocolate.
1426
1427
In fact, the peak of the posterior distribution is at Bowl 67, which corresponds to the fraction of vanilla cookies in the data we've observed, $2/3$.
1428
1429
The quantity with the highest posterior probability is called the {\bf MAP}, which stands for ``maximum a posteori probability'', where ``a posteori'' is unnecessary Latin for ``posterior''.
1430
1431
To compute the MAP, we can use the \py{Series} method \py{idxmax}:
1432
1433
\begin{code}
1434
posterior3.idxmax()
1435
\end{code}
1436
1437
Or \py{Pmf} provides a more memorable name for the same thing:
1438
1439
\begin{code}
1440
posterior3.max_prob()
1441
\end{code}
1442
1443
As you might suspect, this example isn't really about bowls; it's about estimating proportions.
1444
Imagine that you have one bowl of cookies.
1445
You don't know what fraction of cookies are vanilla, but you think it is equally likely to be any fraction from 0 to 1.
1446
If you draw three cookies and two are vanilla, what proportion of cookies in the bowl do you think are vanilla?
1447
The posterior distribution we just computed is the answer to that question.
1448
1449
We'll come back to estimating proportions in the next chapter.
1450
But first let's use a \py{Pmf} to solve the dice problem.
1451
1452
1453
\section{The Dice Problem}
1454
1455
In Section~\ref{dice} we solved the dice problem using a Bayes table.
1456
Here's the statment of the problem again:
1457
1458
\begin{quote}
1459
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
1460
I choose one of the dice at random, roll it, and report that the outcome is a 1.
1461
What is the probability that I chose the 6-sided die?
1462
\end{quote}
1463
1464
Let's solve it again using a \py{Pmf}.
1465
I'll use integers to represent the hypotheses:
1466
1467
\begin{code}
1468
hypos = [6, 8, 12]
1469
\end{code}
1470
1471
And I can make the prior distribution like this:
1472
1473
\begin{code}
1474
prior = Pmf(1/3, hypos)
1475
\end{code}
1476
1477
As in the previous example, the prior probability gets broadcast across the hypotheses.
1478
1479
Now we can compute the likelihood of the data:
1480
1481
\begin{code}
1482
likelihood1 = 1/6, 1/8, 1/12
1483
\end{code}
1484
1485
And use it to compute the posterior distribution.
1486
1487
\begin{code}
1488
posterior = prior * likelihood1
1489
posterior.normalize()
1490
\end{code}
1491
1492
Here's the result:
1493
1494
\input{tables/table02-02}
1495
1496
The posterior probability for the 6-sided die is $4/9$.
1497
1498
Now suppose I roll the same die again and get a $7$.
1499
We can do a second update like this:
1500
1501
\begin{code}
1502
likelihood2 = 0, 1/8, 1/12
1503
posterior *= likelihood2
1504
posterior.normalize()
1505
\end{code}
1506
1507
The likelihood for the 6-sided die is $0$ because it is not possible to get a 7 on a 6-sided die.
1508
The other two likelihoods are the same as in the previous update.
1509
And here's the result:
1510
1511
\input{tables/table02-03}
1512
1513
After rolling a 1 and a 7, the posterior probability of the 8-sided die is about 69\%.
1514
1515
1516
\section{Updating Dice}
1517
\label{dice2}
1518
1519
The following function is a more general version of the update in the previous section:
1520
1521
\begin{code}
1522
def update_dice(pmf, data):
1523
hypos = pmf.qs
1524
likelihood = 1 / hypos
1525
impossible = (data > hypos)
1526
likelihood[impossible] = 0
1527
pmf *= likelihood
1528
pmf.normalize()
1529
\end{code}
1530
1531
The first parameter is a \py{Pmf} that represents the possible dice and their probabilities.
1532
The second parameter is the outcome of rolling a die.
1533
1534
The first line selects \py{qs} from the \py{Pmf}, which is the index of the \py{Series}; in this example, it represents the hypotheses.
1535
1536
Since the hypotheses are integers, we can use them to compute the likelihoods.
1537
In general, if there are \py{n} sides on the die, the probability of any possible outcome is \py{1/n}.
1538
1539
However, we have to check for impossible outcomes!
1540
If the outcome exceeds the hypothetical number of sides on the die, the probability of that outcome is $0$.
1541
1542
\py{impossible} is a Boolean Series that is \py{True} for each impossible die.
1543
I use it as an index into \py{likelihood} to set the corresponding probabilities to $0$.
1544
1545
Finally, I multiply \py{pmf} by the likelihoods and normalize.
1546
1547
Here's how we can use this function to compute the updates in the previous section:
1548
1549
\begin{code}
1550
pmf = prior.copy()
1551
update_dice(pmf, 1)
1552
update_dice(pmf, 7)
1553
\end{code}
1554
1555
I start with a fresh copy of the prior distribution and use \py{update_dice} to do the updates.
1556
The result is the same.
1557
1558
1559
\section{Summary}
1560
1561
This chapter introduces the \py{empiricaldist} module, which provides \py{Pmf}, which we use to represent a set of hypotheses and their probabilities.
1562
1563
We use a \py{Pmf} to solve the cookie problem and the dice problem, which we saw in the previous chapter.
1564
With a \py{Pmf} it is easy to perform sequential updates as we see multiple pieces of data.
1565
1566
We also solved a more general version of the cookie problem, with 101 bowls rather than two.
1567
Then we computed the MAP, which is the quantity with the highest posterior probability.
1568
1569
In the next chapter ...
1570
1571
But first you might want to work on the exercises.
1572
1573
1574
\section{Exercises}
1575
\label{elvis}
1576
1577
The code for this chapter is in \py{chap02.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
1578
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap02.ipynb}.
1579
1580
The notebook provides space where you can work on the following problems.
1581
1582
1583
\begin{exercise}
1584
%TODO: medical test (or maybe chapter 1)
1585
\end{exercise}
1586
1587
1588
\begin{exercise}
1589
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
1590
I choose one of the dice at random, roll it four times, and get 1, 3, 5, and 7.
1591
What is the probability that I chose the 8-sided die?
1592
\end{exercise}
1593
1594
1595
\begin{exercise}
1596
In the previous version of the dice problem, the prior probabilities are the same because the box contains one of each die.
1597
But suppose the box contains 1 die that is 4-sided, 2 dice that are 6-sided, 3 dice that are 8-sided, 4 dice that are 12-sided, and 5 dice that are 20-sided.
1598
I choose a die, roll it, and get a 7. What is the probability that I chose an 8-sided die?
1599
\end{exercise}
1600
1601
1602
\begin{exercise}
1603
Suppose I have two sock drawers.
1604
One contains equal numbers of black and white socks.
1605
The other contains equal numbers of red, green, and blue socks.
1606
Suppose I choose a drawer and random, choose two socks at random, and I tell you that I got a matching pair.
1607
What is the probability that the socks are white?
1608
1609
For simplicity, let's assume that there are so many socks in both drawers that removing one sock makes a negligible change to the proportions.
1610
\end{exercise}
1611
1612
1613
\begin{exercise}
1614
Here's a problem from {\it Bayesian Data Analysis}, which is available from \url{http://www.stat.columbia.edu/~gelman/book}:
1615
1616
\begin{quote}
1617
Elvis Presley had a twin brother (who died at birth). What is the probability that Elvis was an identical twin?
1618
\end{quote}
1619
1620
Hint: In 1935, about 2/3 of twins were fraternal and 1/3 were identical.
1621
\end{exercise}
1622
1623
1624
\chapter{Estimation}
1625
\label{more}
1626
1627
%TODO: Intro
1628
1629
1630
\section{The Euro problem}
1631
\label{euro}
1632
1633
\index{Euro problem}
1634
\index{MacKay, David}
1635
In {\it Information Theory, Inference, and Learning Algorithms}, David MacKay poses this problem:
1636
1637
\begin{quote}
1638
A statistical statement appeared in ``The Guardian'' on Friday January 4, 2002:
1639
1640
\begin{quote}
1641
When spun on edge 250 times, a Belgian one-euro coin came
1642
up heads 140 times and tails 110. `It looks very suspicious
1643
to me,' said Barry Blight, a statistics lecturer at the London
1644
School of Economics. `If the coin were unbiased, the chance of
1645
getting a result as extreme as that would be less than 7\%.'
1646
\end{quote}
1647
1648
But do these data give evidence that the coin is biased rather than fair?
1649
\end{quote}
1650
1651
To answer that question, we'll proceed in two steps.
1652
First we'll use the binomial distribution to see where that 7\% came from; then we'll use Bayes's Theorem to estimate the probability that this coin comes up heads.
1653
1654
1655
\section{The Binomial Distribution}
1656
\label{binomial}
1657
1658
Suppose we have a coin that we know is fair; if we spin it once, the possible outcomes are heads and tails with equal probability.
1659
I'll denote these outcomes \py{H} and \py{T}.
1660
1661
If you spin it twice, there are four outcomes with equal probability: \py{HH}, \py{HT}, \py{TH}, and \py{TT}.
1662
1663
If we add up the total number of heads, there are three possible outcomes: 0, 1, or 2. The probability of 0 and 2 is 25\%, and the probability of 1 is 50\%.
1664
1665
More generally, suppose the probability of heads is \py{p} and we spin the coin \py{n} times. What is the probability that we get a total of \py{k} heads?
1666
1667
The answer is given by the binomial distribution:
1668
%
1669
\[ P(k; n, p) = \binom{n}{k} p^k (1-p)^{n-k} \]
1670
%
1671
where $\binom{n}{k}$ is the {\bf binomial coefficient}, usually pronounced "n choose k" (see \url{https://en.wikipedia.org/wiki/Binomial_coefficient}).
1672
1673
We can compute this expression ourselves, but we can also use the SciPy function \py{binom.pmf}:
1674
1675
\begin{code}
1676
from scipy.stats import binom
1677
1678
n = 2
1679
p = 0.5
1680
ks = np.arange(n+1)
1681
a = binom.pmf(ks, n, p)
1682
\end{code}
1683
1684
The return value is a NumPy array.
1685
If we put it in a \py{Pmf}, the result is the distribution of \py{k} for the given values of \py{n} and \py{p}.
1686
1687
\begin{code}
1688
pmf_k = Pmf(a, ks)
1689
\end{code}
1690
1691
Here's what it looks like:
1692
1693
\input{tables/table02-01}
1694
1695
We can do the same calculation with \py{n=250}; Figure~\ref{fig03-01} shows the result.
1696
1697
\begin{figure}
1698
% chap03soln.ipynb
1699
\centerline{\includegraphics[width=4in]{figs/fig03-01.pdf}}
1700
\caption{Binomial distribution with \py{n=250} and \py{p=0.5}}
1701
\label{fig03-01}
1702
\end{figure}
1703
1704
The most likely outcome is 125, which is \py{n*p}.
1705
But the probability of getting exactly 125 heads is only about 5\%.
1706
The probability of getting 140 heads, as in the Euro problem is lower, around 0.8\%, but it is still possible even if the coin is fair.
1707
1708
In the article MacKay quotes, the statistician says, ``If the coin were unbiased the chance of getting a result as extreme as that would be less than 7\%''.
1709
1710
We can use the binomial distribution to check his math. The following function takes a PMF and computes the total probability of values greater than or equal to \py{threshold}.
1711
1712
\begin{code}
1713
def ge_dist(pmf, threshold):
1714
ge = (pmf.index >= threshold)
1715
total = pmf[ge].sum()
1716
return total
1717
\end{code}
1718
1719
We can call it like this:
1720
1721
\begin{code}
1722
ge_dist(pmf_k, 140)
1723
\end{code}
1724
1725
Or \py{Pmf} provides a function that computes the same thing:
1726
1727
\begin{code}
1728
pmf_k.ge_dist(140)
1729
\end{code}
1730
1731
Either way, the probability is about 3.3\% that we get 140 heads or more.
1732
But that's less than 7%.
1733
1734
The reason is that the statistician includes all values ``as extreme as'' 140, which includes values less than or equal to 110, because 140 exceeds the expected value by 15 and 110 falls short by 15.
1735
1736
The probability of values less than or equal to 110 is also 3.3\%,
1737
so the total probability of values ``as extreme'' as 140 is 6.6\%.
1738
1739
The point of this calculation is that these extreme values are unlikely if the coin is fair.
1740
And that's why the statistician concludes that the results are ``very suspicious''.
1741
1742
That's interesting, but it doesn't answer MacKay's question. So let's move on to the next step, estimating the proportion of heads.
1743
1744
1745
\section{Estimating Proportions}
1746
\label{estprop}
1747
1748
Any given coin has some probability of landing heads up when spun
1749
on edge; I'll call this probability \py{x}.
1750
1751
It seems reasonable to believe that \py{x} depends
1752
on physical characteristics of the coin, like the distribution
1753
of weight.
1754
1755
If a coin is perfectly balanced, we expect \py{x} to be close to 50\%, but
1756
for a lopsided coin, \py{x} might be substantially different. We can use
1757
Bayes's theorem and the observed data to estimate \py{x}.
1758
1759
For simplicity, I'll start with a uniform prior, which assume that all values of \py{x} are equally likely.
1760
That might not be a reasonable assumption, so we'll come back and consider other priors later.
1761
1762
Here's the uniform prior:
1763
1764
\begin{code}
1765
hypos = np.arange(0, 101)
1766
prior = Pmf(1, hypos)
1767
\end{code}
1768
1769
And here are the likelihoods:
1770
1771
\begin{code}
1772
likelihood = {
1773
'H': hypos/100,
1774
'T': 1 - hypos/100
1775
}
1776
\end{code}
1777
1778
I put the likelihoods for heads and tails in a dictionary to make it easier to do the update.
1779
1780
To represent the data, I'll use string where each element is \py{H} or \py{T}:
1781
1782
\begin{code}
1783
dataset = 'H' * 140 + 'T' * 110
1784
\end{code}
1785
1786
The following function does the update.
1787
1788
\begin{code}
1789
def update_euro(pmf, dataset):
1790
for data in dataset:
1791
pmf *= likelihood[data]
1792
1793
pmf.normalize()
1794
\end{code}
1795
1796
The first argument is a \py{Pmf} that represents the prior.
1797
The second argument is a list of strings.
1798
Each time through the loop, we multiply \py{pmf} by the likelihood of one outcome, heads or tails.
1799
1800
Notice that \py{normalize} is outside the loop, so the posterior distribution only gets normalized one, at the end.
1801
That's more efficient than normalizing it after each spin (although we'll see later that it can also cause problems with floating-point arithmetic).
1802
%TODO: add forward reference
1803
1804
Here's how we do the update:
1805
1806
\begin{code}
1807
posterior = prior.copy()
1808
update_euro(posterior, dataset)
1809
\end{code}
1810
1811
Figure~\ref{fig03-02} shows the posterior distribution of \py{x}.
1812
1813
\begin{figure}
1814
% chap03soln.ipynb
1815
\centerline{\includegraphics[width=4in]{figs/fig03-02.pdf}}
1816
\caption{Posterior distribution of \py{x} after 140 heads in 250 spins.}
1817
\label{fig03-02}
1818
\end{figure}
1819
1820
Now, it's easy to get this distribution mixed up with the previous one, but rememeber:
1821
1822
\begin{itemize}
1823
1824
\item Figure~\ref{fig03-01} shows the distribution of \py{k}, which is the number of heads we get with \py{n=250} and \py{p=0.5}.
1825
1826
\item Figure ~\ref{fig03-02} shows the posterior distribution of \py{x} which is the proportion of heads for the coin we observed.
1827
1828
\end{itemize}
1829
1830
The posterior distribution represents our beliefs about \py{x} after seeing the data.
1831
It indicates that values less than 40 and greater than 80 are unlikely; values between 50 and 60 are the most likely.
1832
1833
In fact, the most likely value for \py{x} is 56\% which is the proportion of heads in the dataset, \py{140/250}.
1834
1835
1836
\section{Triangle Prior}
1837
\label{triangle}
1838
1839
So far we've been using a uniform prior, but that might not be a reasonable choice based on what we know about coins.
1840
I can believe that if a coin is lopsided, \py{x} might deviate substantially from 50\%, but it seems unlikely that the Belgian Euro coin is so imbalanced that \py{x} is 10\% or 90\%.
1841
1842
It might be more reasonable to choose a prior that gives
1843
higher probability to values of \py{x} near 50\% and lower probability
1844
to extreme values.
1845
1846
\index{triangle distribution}
1847
1848
As an example, let's try a triangule-shaped prior.
1849
Here's the code that constructs it:
1850
1851
\begin{code}
1852
ramp_up = np.arange(50)
1853
ramp_down = np.arange(50, -1, -1)
1854
a = np.append(ramp_up, ramp_down)
1855
1856
triangle = Pmf(a, hypos, name='triangle')
1857
triangle.normalize()
1858
\end{code}
1859
1860
\py{arange} returns a NumPy array, so we can use \py{np.append} to append \py{ramp_down} to the end of \py{ramp_up}.
1861
Then we use \py{a} and \py{hypos} to make a \py{Pmf}.
1862
1863
Figure~\ref{fig03-03} shows the result, along with the uniform distribution.
1864
1865
\begin{figure}
1866
% chap03soln.ipynb
1867
\centerline{\includegraphics[width=4in]{figs/fig03-03.pdf}}
1868
\caption{Uniform and trianlge-shaped prior distributions.}
1869
\label{fig03-03}
1870
\end{figure}
1871
1872
Now we can update both priors with the same data:
1873
1874
\begin{code}
1875
update_euro(uniform, dataset)
1876
update_euro(triangle, dataset)
1877
\end{code}
1878
1879
Figure~\ref{fig03-04} shows the posterior distributions.
1880
1881
\begin{figure}
1882
% chap03soln.ipynb
1883
\centerline{\includegraphics[width=4in]{figs/fig03-04.pdf}}
1884
\caption{Posterior distributions based on uniform and triangle priors.}
1885
\label{fig03-04}
1886
\end{figure}
1887
1888
The differences between the posterior distributions are barely visible, and so small they would hardly matter in practice.
1889
1890
And that's good news.
1891
To see why, imagine two people who disagree angrily about which prior is better, uniform or triangle.
1892
Each of them has reasons for their preference, but neither of them can persuade the other to change their mind.
1893
1894
But suppose they agree to use the data to update their beliefs.
1895
When they compare their posterior distributions, they find that there is almost nothing left to argue about.
1896
1897
This is an example of {\bf swamping the priors}: with enough
1898
data, people who start with different priors will tend to
1899
converge on the same posterior distribution.
1900
1901
\index{swamping the priors}
1902
\index{convergence}
1903
1904
1905
\section{Binomial Likelihood}
1906
\label{binomlike}
1907
1908
So far we've been computing the updates one spin at a time, so for the Euro problem we have to do 250 updates.
1909
1910
A more efficient alternative is to compute the likelihood of the entire dataset at once.
1911
For each hypothetical value of \py{x}, we have to compute the probability of getting 140 heads out of 250 spins.
1912
1913
Well, we know how to do that; this is the question the binomial distribution answers.
1914
If the probability of heads is $p$, the probability of $k$ heads in $n$ spins is:
1915
%
1916
\[ P(k; n, p) = \binom{n}{k} p^k (1-p)^{n-k} \]
1917
%
1918
And we can use SciPy to compute it.
1919
The following function takes a \py{Pmf} that represents a prior distribution and a tuple of integers, \py{k} and \py{n}:
1920
1921
\begin{code}
1922
from scipy.stats import binom
1923
1924
def update_binomial(pmf, data):
1925
k, n = data
1926
xs = pmf.qs
1927
likelihood = binom.pmf(k, n, xs)
1928
pmf *= likelihood
1929
pmf.normalize()
1930
\end{code}
1931
1932
It extracts the hypothetical values of \py{x} from the \py{Pmf} and passes them to \py{binom.pmf}, which computes the binomial PMF for the given values of \py{k} and \py{n}, and all values of \py{x}.
1933
1934
Here's how we use it:
1935
1936
\begin{code}
1937
uniform2 = Pmf(1, hypos)
1938
data = 140, 250
1939
update_binomial(uniform2, data)
1940
\end{code}
1941
1942
The result is the same as in Section~\ref{estprop} except for a small floating-point round-off.
1943
But it's much more efficient.
1944
1945
1946
\section{Bayesian Statistics}
1947
1948
You might have noticed similarities between the Euro problem and the 101 bowls problem in Section~\ref{morebowls}.
1949
The prior distributions are the same, the likelihoods are the same, and with the same data the results would be the same.
1950
1951
But there are two differences.
1952
1953
The first is the choice of the prior.
1954
In the 101 bowls problem, the uniform prior is implied by the statement of the problem, which says that we choose one of the bowls at random with equal probability.
1955
1956
In the Euro problem, the choice of the prior is subjective; that is, reasonable people could disagree, maybe because they have different information about coins or because they interpret the same information differently.
1957
1958
Because the priors are subjective, the posteriors are subjective, too.
1959
And some people find that problematic.
1960
1961
The other difference is the nature of what we are estimating.
1962
In the 101 bowls problem, we choose the bowl randomly, so it is uncontroversial to compute the probability of choosing each bowl.
1963
In the Euro problem, the proportion of heads is a physical property of a given coin.
1964
Under some interpretations of probability, that's a problem because physical properties are not considered random.
1965
1966
As an example, consider the age of the universe.
1967
Currently, our best estimate is 13.80 billion years, but it might be off by 0.02 billion years in either direction (see \url{https://en.wikipedia.org/wiki/Age_of_the_universe}).
1968
1969
Now suppose we would like to know the probability that the age of the universe is actually greater than 13.81 billion years.
1970
Under some interpretations of probability, we would not be able to answer that question.
1971
We would be required to say something like, ``The age of the universe is not a random quantity, so it has no probability of exceeding a particular value.''
1972
1973
Under the Bayesian interpretation of probability, it is meaningful and useful to treat physical quantities as if they were random and compute probabilities about them.
1974
1975
In the Euro problem, the prior distribution represents what we believe about coins in general and the posterior distribution represents what we believe about a particular coin after seeing the data.
1976
So we can use the posterior distribution to compute probabilities about the coin and its proportion of heads.
1977
1978
The subjectivity of the prior and the interpretation of the posterior are key differences between Bayes's Theorem and Bayesian statistics.
1979
1980
Bayes's Theorem is a mathematical law of probability; no reasonable person objects to it.
1981
But Bayesian statistics is surprisingly controversial.
1982
Historically, many people have been bothered by its subjectivity and its use of probability for things that are not random.
1983
1984
If you are interested in this history, I recommend Sharon Bertsch McGrayne's book, {\it The Theory That Would Not Die} (\url{https://yalebooks.yale.edu/book/9780300188226/theory-would-not-die}).
1985
1986
\index{McGrayne, Sharon Bertsch}
1987
\index{The Theory That Would Not Die}
1988
1989
%TODO: Italicize the index entry
1990
1991
\section{Summary}
1992
1993
In this chapter I posed David MacKay's Euro problem and we started to solve it.
1994
Given the data, we computed the posterior distribution for \py{x}, the probability a Euro coin comes up heads.
1995
1996
We tried two different priors, updated them with the same data, and found that the posteriors were nearly the same.
1997
This is good news, because it suggests that if two people start with different beliefs and see the same data, their beliefs tend to converge.
1998
1999
This chapter introduces the binomial distribution, which we used to compute the posterior distribution more efficiently.
2000
And I discussed the difference between applying Bayes's Theorem, as in the 101 bowls problem, and computing Bayesian statistics, as in the Euro problem.
2001
2002
\index{convergence}
2003
2004
However, we still haven't answered MacKay's question: ``Do these data give evidence that the coin is biased rather than fair?''
2005
I'm going to leave this question hanging a little longer; we'll come back to it in Chapter~\ref{hypotest}.
2006
2007
In the next chapter, I want to get back to the dice problem.
2008
2009
\section{Exercises}
2010
2011
The code for this chapter is in \py{chap03.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
2012
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap03.ipynb}.
2013
2014
The notebook provides space where you can work on the following problems.
2015
2016
2017
\begin{exercise}
2018
In Major League Baseball, most players have a batting average between 200 and 330, which means that the probability of getting a hit is between 0.2 and 0.33.
2019
2020
Suppose a new player appearing in his first game gets 3 hits out of 3 attempts. What is the posterior distribution for his probability of getting a hit?
2021
\end{exercise}
2022
2023
2024
\begin{exercise}
2025
Whenever you survey people about sensitive issues, you have to deal with ``social desirability bias'', which is the tendency of people to shade their answers to show themselves in the most positive light (see \url{https://en.wikipedia.org/wiki/Social_desirability_bias}).
2026
2027
One of the ways to improve the accuracy of the results is ``randomized response'' (see \url{https://en.wikipedia.org/wiki/Randomized_response}).
2028
2029
As an example, suppose you ask 100 people to flip a coin and:
2030
2031
\begin{itemize}
2032
2033
\item If they get heads, they report YES.
2034
2035
\item If they get tails, they honestly answer the question ``Do you cheat on your taxes?''
2036
2037
\end{itemize}
2038
2039
And suppose you get 80 YESes and 20 NOs. Based on this data, what is the posterior distribution for the fraction of people who cheat on their taxes? What is the most likely value in the posterior distribution?
2040
\end{exercise}
2041
2042
2043
\begin{exercise}
2044
Suppose that instead of observing coin spins directly, you measure the outcome using an instrument that is not always correct. Specifically, suppose the probability is \py{y=0.2} that an actual heads is reported
2045
as tails, or actual tails reported as heads.
2046
2047
If we spin a coin 250 times and the instrument reports 140 heads, what is the posterior distribution of \py{x}?
2048
2049
What happens as you vary the value of \py{y}?
2050
\end{exercise}
2051
2052
2053
\begin{exercise}
2054
In preparation for an alien invasion, the Earth Defense League (EDL) has been working on new missiles to shoot down space invaders. Of course, some missile designs are better than others; let's assume that each design has some probability of hitting an alien ship, \py{x}.
2055
2056
Based on previous tests, the distribution of \py{x} in the population of designs is approximately uniform between 0.1 and 0.4.
2057
2058
Now suppose the new ultra-secret Alien Blaster 9000 is being tested. In a press conference, an EDL general reports that the new design has been tested twice, taking two shots during each test. The results of the test are confidential, so the general won't say how many targets were hit, but they report: ``The same number of targets were hit in the two tests, so we have reason to think this new design is consistent.''
2059
2060
Is this data good or bad; that is, does it increase or decrease your estimate of \py{x} for the Alien Blaster 9000?
2061
2062
Hint: If the probability of hitting each target is $x$, the probability of hitting one target in both tests is $[2x(1-x)]^2$.
2063
\end{exercise}
2064
2065
2066
\chapter{More Estimation}
2067
\label{estimation}
2068
2069
\section{The train problem}
2070
2071
\index{train problem}
2072
\index{Mosteller, Frederick}
2073
\index{German tank problem}
2074
2075
I found the train problem
2076
in Frederick Mosteller's, {\it Fifty Challenging Problems in
2077
Probability with Solutions} (\url{https://store.doverpublications.com/0486653552.html}):
2078
2079
\begin{quote}
2080
``A railroad numbers its locomotives in order $1..N$. One day you see a
2081
locomotive with the number 60. Estimate how many locomotives the
2082
railroad has.''
2083
\end{quote}
2084
2085
Based on this observation, we know the railroad has 60 or more
2086
locomotives. But how many more? To apply Bayesian reasoning, we
2087
can break this problem into two steps:
2088
2089
\begin{enumerate}
2090
2091
\item What did we know about $N$ before we saw the data?
2092
2093
\item For any given value of $N$, what is the likelihood of
2094
seeing the data (a locomotive with number 60)?
2095
2096
\end{enumerate}
2097
2098
The answer to the first question is the prior. The answer to the
2099
second is the likelihood.
2100
2101
\begin{figure}
2102
% train.py
2103
\centerline{\includegraphics[height=2.5in]{figs/train1.pdf}}
2104
\caption{Posterior distribution for the locomotive problem, based
2105
on a uniform prior.}
2106
\label{fig.train1}
2107
\end{figure}
2108
2109
We don't have much basis to choose a prior, so we'll start with
2110
something simple and then consider alternatives.
2111
Let's assume that $N$ is equally likely to be any value from 1 to 1000.
2112
2113
\begin{code}
2114
hypos = np.arange(1, 1001)
2115
prior = Pmf(1, hypos)
2116
\end{code}
2117
2118
Now let's figure out the likelihood of the data.
2119
In a hypothetical fleet of $N$ locomotives, what is the probability that we would see number 60?
2120
If we assume that we are equally likely to see any locomotive, the chance of seeing any particular one is $1/N$.
2121
2122
Here's the function that does the update:
2123
2124
\begin{code}
2125
def update_train(pmf, data):
2126
hypos = pmf.qs
2127
likelihood = 1 / hypos
2128
impossible = (data > hypos)
2129
likelihood[impossible] = 0
2130
pmf *= likelihood
2131
pmf.normalize()
2132
\end{code}
2133
2134
The first parameter is a \py{Pmf} that represents the possible values of $N$ and their probabilities.
2135
The second parameter is the number of the train we observed.
2136
2137
This function might look familiar; it is the same as the update function for the dice problem in Section~\ref{dice2}.
2138
2139
\index{dice problem}
2140
2141
Here's the update:
2142
2143
\begin{code}
2144
data = 60
2145
posterior = prior.copy()
2146
update_train(posterior, data)
2147
\end{code}
2148
2149
Figure~\ref{fig04-01} shows the results.
2150
2151
\begin{figure}
2152
% chap04soln.ipynb
2153
\centerline{\includegraphics[width=4in]{figs/fig04-01.pdf}}
2154
\caption{Posterior distribution of the number of trains, $N$, after seeing train number 60.}
2155
\label{fig04-01}
2156
\end{figure}
2157
2158
Not surprisingly, all values of $N$ below 60 have been eliminated.
2159
2160
The most likely value, if you had to guess, is 60.
2161
That might not seem like a very good guess; after all, what are the chances that you just happened to see the train with the highest number?
2162
Nevertheless, if you want to maximize the chance of getting
2163
the answer exactly right, you should guess 60.
2164
2165
But maybe that's not the right goal.
2166
An alternative is to compute the mean of the posterior distribution.
2167
Given a set of possible quantities, $q_i$, and their probabilities, $p_i$, the mean of the distribution is:
2168
%
2169
\[ \mathrm{mean} = \sum_i p_i q_i \]
2170
%
2171
Which we can compute like this:
2172
2173
\begin{code}
2174
np.sum(posterior.ps * posterior.qs)
2175
\end{code}
2176
2177
Or we can use the method provided by \py{Pmf}:
2178
2179
\begin{code}
2180
posterior.mean()
2181
\end{code}
2182
2183
The mean of the posterior is 333, so that might be a good guess if you want to minimize error.
2184
If you played this guessing game over and over, using the mean of the posterior as your estimate would minimize the mean squared error over the long run (see \url{http://en.wikipedia.org/wiki/Minimum_mean_square_error}).
2185
2186
\index{mean squared error}
2187
2188
2189
\section{What about that prior?}
2190
2191
The prior I chose in the previous section is uniform from 1 to 1000, but I offered no justification for choosing a uniform distribution or that particular upper bound.
2192
2193
\index{prior distribution}
2194
2195
We might wonder whether the posterior distribution is sensitive to the prior.
2196
With so little data---only one observation---it is:
2197
2198
\begin{itemize}
2199
2200
\item With a uniform prior from 1 to 500, the posterior mean is 207.
2201
2202
\item With an upper bound of 1000, it's 333.
2203
2204
\item With an upper bound of 2000, it's 552.
2205
2206
\end{itemize}
2207
2208
So that's bad.
2209
When the posterior is sensitive to the prior, there are two ways to proceed:
2210
2211
\begin{itemize}
2212
2213
\item Get more data.
2214
2215
\item Get more background information and choose a better prior.
2216
2217
\end{itemize}
2218
2219
With more data, posterior distributions based on different
2220
priors tend to converge.
2221
For example, suppose that in addition
2222
to train 60 we also see trains 30 and 90.
2223
We can update the distribution like this:
2224
2225
\begin{code}
2226
for data in [30, 60, 90]:
2227
update_train(pmf, data)
2228
\end{code}
2229
2230
With these data, the means of the posteriors are
2231
2232
\begin{tabular}{r r}
2233
\toprule
2234
Upper & Posterior \\
2235
Bound & Mean \\
2236
\midrule
2237
500 & 152 \\
2238
1000 & 164\\
2239
2000 & 171\\
2240
\bottomrule
2241
\end{tabular}
2242
2243
The differences are smaller, but apparently three trains is not enough for the posteriors to converge.
2244
2245
2246
\section{Another prior}
2247
2248
\begin{figure}
2249
% train.py
2250
\centerline{\includegraphics[height=2.5in]{figs/train4.pdf}}
2251
\caption{Posterior distribution based on a power law prior,
2252
compared to a uniform prior.}
2253
\label{fig.train4}
2254
\end{figure}
2255
2256
If more data are not available, another option is to improve the
2257
priors by gathering more background information.
2258
It is probably not reasonable to assume that a train-operating company with 1000 locomotives is just as likely as a company with only 1.
2259
2260
With some effort, we could probably find a list of companies that
2261
operate locomotives in the area of observation.
2262
Or we could interview an expert in rail shipping to gather information about the typical size of companies.
2263
2264
But even without getting into the specifics of railroad economics, we
2265
can make some educated guesses.
2266
In most fields, there are many small
2267
companies, fewer medium-sized companies, and only one or two very
2268
large companies.
2269
In fact, the distribution of company sizes tends to
2270
follow a power law, as Robert Axtell reports in {\it Science} (see
2271
\url{https://sci-hub.tw/10.1126/science.1062081}).
2272
2273
% \url{http://www.sciencemag.org/content/293/5536/1818.full.pdf}
2274
2275
\index{power law}
2276
\index{Axtell, Robert}
2277
2278
This law suggests that if there are 1000 companies with fewer than
2279
10 locomotives, there might be 100 companies with 100 locomotives,
2280
10 companies with 1000, and possibly one company with 10,000 locomotives.
2281
2282
Mathematically, a power law means that the number of companies
2283
with a given size is inversely proportional to size, or
2284
%
2285
\[ \PMF(N) \sim \left( \frac{1}{N} \right)^{\alpha} \]
2286
%
2287
where $\PMF(N)$ is the probability mass function of $N$ and $\alpha$ is
2288
a parameter that is often near 1.
2289
2290
We can construct a power law prior like this:
2291
2292
\begin{code}
2293
alpha = 1.0
2294
hypos = np.arange(1, 1001)
2295
ps = hypos**(-alpha)
2296
power = Pmf(ps, hypos, name='power law')
2297
power.normalize()
2298
\end{code}
2299
2300
Again, the upper bound is arbitrary, but with a power law prior, the posterior is less sensitive to this choice.
2301
2302
\begin{figure}
2303
% chap04soln.ipynb
2304
\centerline{\includegraphics[width=4in]{figs/fig04-02.pdf}}
2305
\caption{Posterior distributions for the uniform and power law priors
2306
after seeing train 60.}
2307
\label{fig04-02}
2308
\end{figure}
2309
2310
Figure~\ref{fig04-02} shows the new posterior based on the power law prior, compared to the posterior based on the uniform prior, both after seeing train number 60.
2311
2312
With the power law prior, the posterior is less sensitive to the choice of the upper bound.
2313
If we observe trains 30, 60, and 90, the means of the posteriors are
2314
2315
\begin{tabular}{rr}
2316
\toprule
2317
Upper & Posterior \\
2318
Bound & Mean \\
2319
\midrule
2320
500 & 131 \\
2321
1000 & 133 \\
2322
2000 & 134 \\
2323
\bottomrule
2324
\end{tabular}
2325
2326
Now the differences are much smaller. In fact,
2327
with an arbitrarily large upper bound, the mean converges on 134.
2328
2329
So the power law prior is more realistic, because it is based on
2330
general information about the size of companies, and it behaves better in practice.
2331
2332
2333
\section{Credible intervals}
2334
\label{credible}
2335
2336
So far we have seen two ways to summarize a posterior distribution: the value with the highest posterior probability (the MAP) and the posterior mean.
2337
These are both {\bf point estimates}, that is, single values that estimate the quantity we are interested in.
2338
2339
Another way to summarize posterior distribution is with percentiles.
2340
If you have taken a standardized test, you might be familiar with percentiles.
2341
For example, if your score is the 90th percentile, that means you did as well as or better than 90\% of the people who took the test.
2342
2343
If we are given a value, \py{x}, we can compute its {\bf percentile rank} by finding all values less than or equal to \py{x} and adding up their probabilities.
2344
\py{Pmf} provides a method that does this computation.
2345
So, for example, we can compute the probability that the company has less than or equal to 100 trains:
2346
2347
\begin{code}
2348
power.lt_dist(100)
2349
\end{code}
2350
2351
With a power law prior and a dataset of three trains, the result is about 27\%.
2352
So 100 trains is the 27th percentile.
2353
2354
Going the other way, suppose we want to compute a particular percentile; for example, the median of a distriution is the 50th percentile.
2355
We can compute it by adding up probabilities until the total exceeds 0.5.
2356
Here's a function that does it:
2357
2358
\begin{code}
2359
def quantile(pmf, prob):
2360
total = 0
2361
for q, p in pmf.items():
2362
total += p
2363
if total >= prob:
2364
return q
2365
return np.nan
2366
\end{code}
2367
2368
\py{pmf} represents a normalized distribution.
2369
\py{prob} is the probability of the percentile we want to compute.
2370
2371
The loop uses \py{items}, which iterates the quantities and probabilities in the distribution.
2372
Inside the loop we add up the probabilities of the quantities in order.
2373
When the total equals or exceeds \py{prob}, we return the corresponding quantity.
2374
2375
This function is called \py{quantile} because it computes a quantile rather than a percentile.
2376
The difference is the way we specify \py{prob}.
2377
If \py{prob} is a percentage between 0 and 100, we call the corresponding quantity a percentile.
2378
If \py{prob} is a probability between 0 and 1, we call the corresponding quantity a {\bf quantile}.
2379
2380
Here's how we can use this function to compute the median of the posterior distribution:
2381
2382
\begin{code}
2383
quantile(power, 0.5)
2384
\end{code}
2385
2386
The result, 113 trains, is the median of the posterior distribution.
2387
2388
\py{Pmf} provides a method called \py{quantile} that does the same thing.
2389
We can call it like this to compute the 5th and 9th percentiles:
2390
2391
\begin{code}
2392
power.quantile([0.05, 0.95])
2393
\end{code}
2394
2395
The result is the interval from 91 to 242 trains, which implies:
2396
2397
\begin{itemize}
2398
2399
\item The probability is 5\% that the number of trains is less than or equal to 91.
2400
2401
\item The probability is 5\% that the number of trains is greater than 242.
2402
2403
\end{itemize}
2404
2405
Therefore the probability is 90\% that the number of trains falls between 91 and 242 (excluding 91 and including 242).
2406
For this reason, this interval is called a 90\% {\bf credible interval}.
2407
2408
\py{Pmf} also provides \py{credible_interval}, which computes an interval that contains the given probability.
2409
2410
\begin{code}
2411
power.credible_interval(0.9)
2412
\end{code}
2413
2414
2415
2416
2417
\section{The German tank problem}
2418
2419
During World War II, the Economic Warfare Division of the American
2420
Embassy in London used statistical analysis to estimate German
2421
production of tanks and other equipment.\footnote{Ruggles and Brodie,
2422
``An Empirical Approach to Economic Intelligence in World War II,''
2423
{\em Journal of the American Statistical Association}, Vol. 42,
2424
No. 237 (March 1947).}
2425
2426
The Western Allies had captured log books, inventories, and repair
2427
records that included chassis and engine serial numbers for individual
2428
tanks.
2429
2430
Analysis of these records indicated that serial numbers were allocated
2431
by manufacturer and tank type in blocks of 100 numbers, that numbers
2432
in each block were used sequentially, and that not all numbers in each
2433
block were used. So the problem of estimating German tank production
2434
could be reduced, within each block of 100 numbers, to a form of the
2435
locomotive problem.
2436
2437
Based on this insight, American and British analysts produced
2438
estimates substantially lower than estimates from other forms
2439
of intelligence. And after the war, records indicated that they were
2440
substantially more accurate.
2441
2442
They performed similar analyses for tires, trucks, rockets, and other
2443
equipment, yielding accurate and actionable economic intelligence.
2444
2445
The German tank problem is historically interesting; it is also a nice
2446
example of real-world application of statistical estimation. So far
2447
many of the examples in this book have been toy problems, but it will
2448
not be long before we start solving real problems. I think it is an
2449
advantage of Bayesian analysis, especially with the computational
2450
approach we are taking, that it provides such a short path from a
2451
basic introduction to the research frontier.
2452
2453
2454
\section{Informative priors}
2455
2456
Among Bayesians, there are two approaches to choosing prior
2457
distributions. Some recommend choosing the prior that best represents
2458
background information about the problem; in that case the prior
2459
is said to be {\bf informative}. The problem with using an informative
2460
prior is that people might use different background information (or
2461
interpret it differently). So informative priors often seem subjective.
2462
\index{informative prior}
2463
2464
The alternative is a so-called {\bf uninformative prior}, which is
2465
intended to be as unrestricted as possible, in order to let the data
2466
speak for themselves. In some cases you can identify a unique prior
2467
that has some desirable property, like representing minimal prior
2468
information about the estimated quantity.
2469
\index{uninformative prior}
2470
2471
Uninformative priors are appealing because they seem more
2472
objective. But I am generally in favor of using informative priors.
2473
Why? First, Bayesian analysis is always based on
2474
modeling decisions. Choosing the prior is one of those decisions, but
2475
it is not the only one, and it might not even be the most subjective.
2476
So even if an uninformative prior is more objective, the entire analysis
2477
is still subjective.
2478
2479
\index{modeling}
2480
\index{subjectivity}
2481
\index{objectivity}
2482
2483
Also, for most practical problems, you are likely to be in one of two
2484
regimes: either you have a lot of data or not very much. If you have
2485
a lot of data, the choice of the prior doesn't matter very much;
2486
informative and uninformative priors yield almost the same results.
2487
We'll see an example like this in the next chapter.
2488
2489
But if, as in the locomotive problem, you don't have much data,
2490
using relevant background information (like the power law distribution)
2491
makes a big difference.
2492
\index{locomotive problem}
2493
2494
And if, as in the German tank problem, you have to make life-and-death
2495
decisions based on your results, you should probably use all of the
2496
information at your disposal, rather than maintaining the illusion of
2497
objectivity by pretending to know less than you do.
2498
\index{German tank problem}
2499
2500
2501
\section{Exercises}
2502
2503
The code for this chapter is in \py{chap04.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
2504
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap04.ipynb}.
2505
2506
The notebook provides space where you can work on the following problems.
2507
2508
2509
\begin{exercise}
2510
Suppose you are giving a talk in a large lecture hall and you want to estimate the number of people in the audience. There are too many to count, so you ask how many people were born on May 11 and two people raise their hands. You ask how many were born on May 23 and 1 person raises their hand. Finally, you ask how many were born on August 1, and no one raises their hand.
2511
2512
How many people are in the audience? What is the 90\% credible interval for your estimate? Hint: Remember the binomial distribution.
2513
\end{exercise}
2514
2515
2516
\begin{exercise}
2517
I often see rabbits in the garden behind my house, but it's not easy to tell them apart, so I don't really know how many there are.
2518
2519
Suppose I deploy a motion-sensing camera trap that takes a picture of the first rabbit it sees each day. After three days, I compare the pictures and conclude that two of them are the same rabbit and the other is different.
2520
2521
How many rabbits visit my garden?
2522
2523
To answer this question, we have to think about the prior distribution and the likelihood of the data:
2524
2525
\begin{itemize}
2526
2527
\item I have sometimes seen four rabbits at the same time, so I know there are at least that many. I would be surprised if there were more than 10. So, at least as a starting place, I think a uniform prior from 4 to 10 is reasonable.
2528
2529
\item To keep things simple, let's assume that all rabbits who visit my garden are equally likely to be caught by the camera trap in a given day. Let's also assume it is guaranteed that the camera trap gets a picture every day.
2530
2531
\end{itemize}
2532
2533
\end{exercise}
2534
2535
\begin{exercise}
2536
Suppose that in the criminal justice system, all prison sentences are either 1, 2, or 3 years, with an equal number of each. One day, you visit a prison and choose a prisoner at random. What is the probability that they are serving a 3-year sentence? What is the average remaining sentence of the prisoners you observe?
2537
\end{exercise}
2538
2539
2540
\begin{exercise}
2541
If I chose a random adult in the U.S., what is the probability that they have a sibling? To be precise, what is the probability that their mother has had at least one other child?
2542
2543
This article from the Pew Research Center provides some relevant data: \url{https://www.pewsocialtrends.org/2015/05/07/family-size-among-mothers}. You will have to make some simplifying assumptions.
2544
\end{exercise}
2545
2546
2547
\begin{exercise}
2548
The Doomsday argument is ``a probabilistic argument that claims to predict the number of future members of the human species given an estimate of the total number of humans born so far.'' See \url{https://en.wikipedia.org/wiki/Doomsday_argument}.
2549
2550
Suppose there are only two kinds of civilizations that can happen in the universe. The ``short-lived'' kind go exinct after only 200 billion individuals are born. The ``long-lived'' kind survive until 2,000 billion individuals are born. And suppose that the two kinds of civilization are equally likely. Which kind of civilization do you think we live in?
2551
2552
The Doomsday argument says we can use the total number of humans born so far as evidence.
2553
According to the Population Reference Bureau, the total number of people who have ever lived is about 108 billion.
2554
2555
Since you were born quite recently, let's assume that you are, in fact, human being number 108 billion.
2556
If $N$ is the total number who will ever live and we consider you to be a randomly-chosen person, it is equally likely that you could have been person 1, or $N$, or any number in between.
2557
So what is the probability that you would be number 108 billion?
2558
2559
Given this data and dubious prior, what is the probability that our civilization will be short-lived?
2560
2561
\end{exercise}
2562
2563
2564
2565
%\begin{exercise}
2566
%To write a likelihood function for the locomotive problem, we had
2567
%to answer this question: ``If the railroad has $N$ locomotives, what
2568
%is the probability that we see number 60?''
2569
%
2570
%The answer depends on what sampling process we use when we observe the
2571
%locomotive. In this chapter, I resolved the ambiguity by specifying
2572
%that there is only one train-operating company (or only one that we
2573
%care about).
2574
%
2575
%But suppose instead that there are many companies with different
2576
%numbers of trains. And suppose that you are equally likely to see any
2577
%train operated by any company.
2578
%In that case, the likelihood function is different because you
2579
%are more likely to see a train operated by a large company.
2580
%
2581
%As an exercise, implement the likelihood function for this variation
2582
%of the locomotive problem, and compare the results.
2583
%
2584
%# Solution
2585
%
2586
%# Suppose Company A has N trains and all other companies have M.
2587
%# The chance that we would observe one of Company A's trains is
2588
%# $N/(N+M)$.
2589
%
2590
%# Given that we observe one of Company A's trains, the chance that we
2591
%# observe number 60 is $1/N$ for $N \ge 60$.
2592
%
2593
%# The product of these probabilities is $1/(N+M)$, which is the
2594
%# probability of observing any given train.
2595
%
2596
%# If N<<M, this converges to a constant, which means that all values
2597
%# of $N$ have the same likelihood, so we learn nothing about how many
2598
%# trains Company A has.
2599
%
2600
%# If N>>M, this converges to $1/N$, which is what we saw in the
2601
%# previous solution.
2602
%
2603
%# More generally, if M is unknown, we would need a prior distribution
2604
%# for M, then we can do a two-dimensional update, and then extract the posterior
2605
%# distribution for N.
2606
%
2607
%# We'll see how to do that soon.
2608
%\end{exercise}
2609
2610
2611
2612
2613
\chapter{Odds and Addends}
2614
2615
2616
This chapter presents a new way to represent a degree of certainty, called ``odds'', and a new form of Bayes's Theorem, called Bayes's Rule.
2617
Bayes's Rule is convenient if you want to do a Bayesian update on paper or in your head.
2618
It also sheds light on the important idea of ``evidence'' and how we can quantify the strength of evidence.
2619
2620
The second part of the chapter is about ``addends'', that is, quantities being added, and how we can compute their distributions.
2621
We'll define functions that compute the distribution of a sum, difference, or result of another operation.
2622
And then we'll use those distributions as part of a Bayesian update.
2623
2624
As an exercise, you'll have a chance to solve the Congress problem:
2625
2626
\begin{quote}
2627
There are 538 members of the United States Congress.
2628
Suppose we audit their investment portfolios and find that 312 of them outperform the market.
2629
Let's assume that an honest member of Congress has only a 50\% chance of outperforming the market, but a dishonest member who trades on inside information has a 90\% chance. How many members of Congress are honest?
2630
\end{quote}
2631
2632
2633
\section{Odds}
2634
2635
One way to represent a degree of certainty is a probability in the form of a number between 0 and 1, but that's not the only way.
2636
If you have ever bet on a football game or a horse race, you might have encountered another representation of certainty, called {\bf odds}.
2637
2638
\index{odds}
2639
2640
You might have heard expressions like ``the odds are
2641
three to one,'' but you might not know what that means.
2642
The {\bf odds in favor} of an event are the ratio of the probability
2643
it will occur to the probability that it will not.
2644
2645
So if I think my team has a 75\% chance of winning, I would
2646
say that the odds in their favor are three to one, because
2647
the chance of winning is three times the chance of losing.
2648
2649
You can write odds in decimal form, but it is also common to
2650
write them as a ratio of integers. So ``three to one'' is
2651
written $3:1$.
2652
2653
When probabilities are low, it is more common to report the
2654
{\bf odds against} rather than the odds in favor. For
2655
example, if I think my horse has a 10\% chance of winning,
2656
I would say that the odds against are $9:1$.
2657
2658
Probabilities and odds are different representations of the
2659
same information. Given a probability, you can compute the
2660
odds like this:
2661
2662
\begin{code}
2663
def odds(p):
2664
return p / (1-p)
2665
\end{code}
2666
2667
Given the odds in favor, in decimal form, you can convert to probability like this:
2668
2669
\begin{code}
2670
def prob(o):
2671
return o / (o+1)
2672
\end{code}
2673
2674
If you represent odds with a numerator and denominator, you
2675
can convert to probability like this:
2676
2677
\begin{code}
2678
def prob2(yes, no):
2679
return yes / (yes + no)
2680
\end{code}
2681
2682
When I work with odds in my head, I find it helpful to picture
2683
people at the track. If 20\% of them think my horse will win,
2684
then 80\% of them don't, so the odds in favor are $20:80$ or
2685
$1:4$.
2686
2687
If the odds are $5:1$ against my horse, then five out of six
2688
people think she will lose, so the probability of winning
2689
is $1/6$.
2690
2691
\index{horse racing}
2692
2693
2694
\section{Bayes's Rule}
2695
2696
\index{Bayes's Rule}
2697
2698
In Chapter~\ref{intro} I wrote Bayes's theorem in the {\bf probability
2699
form}:
2700
%
2701
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
2702
%
2703
If we have two hypotheses, $A$ and $B$,
2704
we can write the ratio of posterior probabilities like this:
2705
%
2706
\[ \frac{\p{A|D}}{\p{B|D}} = \frac{\p{A}~\p{D|A}}
2707
{\p{B}~\p{D|B}} \]
2708
%
2709
Notice that the total probability of the data, \p{D}, drops out of
2710
this equation.
2711
2712
Writing \odds{A} for odds in favor of $A$, we use the definition of odds to write:
2713
%
2714
\[ \odds{A} = \frac{\p{A}}{1-\p{A}} \]
2715
%
2716
If $A$ and $B$ are mutually exclusive and collectively exhaustive,
2717
that means $\p{B} = 1 - \p{A}$, so we can write
2718
%
2719
\[ \odds{A} = \frac{\p{A}}{\p{B}} \]
2720
%
2721
By the same process, we can write the posterior odds like this:
2722
%
2723
\[ \odds{A|D} = \frac{\p{A|D}}{\p{B|D}} \]
2724
%
2725
Putting it all together, we have:
2726
%
2727
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
2728
%
2729
This is Bayes's Rule, which says that the posterior odds are the prior odds times the likelihood ratio.
2730
2731
This form of Bayes's Theorem is convenient for computing a Bayesian update on paper or in your head.
2732
For example, let's go back to the cookie problem:
2733
\index{cookie problem}
2734
2735
\begin{quote}
2736
Suppose there are two bowls of cookies.
2737
Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
2738
Bowl 2 contains 20 of each.
2739
2740
Now suppose you choose one of the bowls at random and, without
2741
looking, select a cookie at random.
2742
The cookie is vanilla.
2743
What is the probability that it came from Bowl 1?
2744
\end{quote}
2745
2746
The prior probability is 50\%, so the prior odds are $1$.
2747
The likelihood ratio is $\frac{3}{4} / \frac{1}{2}$, or $3/2$.
2748
So the posterior odds are $3/2$, which corresponds to probability
2749
$3/5$.
2750
2751
2752
\section{Oliver's blood}
2753
\label{oliver}
2754
2755
\index{Oliver's blood problem}
2756
\index{MacKay, David}
2757
2758
I'll use Bayes's Rule to solve another problem from MacKay's {\it Information Theory, Inference, and Learning Algorithms}:
2759
2760
\begin{quote}
2761
Two people have left traces of their own blood at the scene of
2762
a crime. A suspect, Oliver, is tested and found to have type
2763
`O' blood. The blood groups of the two traces are found to
2764
be of type `O' (a common type in the local population, having frequency
2765
60\%) and of type `AB' (a rare type, with frequency 1\%).
2766
Do these data [the traces found at the scene] give evidence
2767
in favor of the proposition that Oliver was one of the people
2768
[who left blood at the scene]?
2769
\end{quote}
2770
2771
To answer this question, we need to think about what it means
2772
for data to give evidence in favor of (or against) a hypothesis.
2773
Intuitively, we might say that data favor a hypothesis if the
2774
hypothesis is more likely in light of the data than it was before.
2775
2776
\index{evidence}
2777
2778
In the cookie problem, the prior odds are $1$, or probability 50\%.
2779
The posterior odds are $3/2$, or probability 60\%.
2780
So the vanilla cookie is evidence in favor of Bowl 1.
2781
2782
Bayes's Rule provides a way to make this intuition more precise. Again
2783
%
2784
\[ \odds{A|D} = \odds{A}~\frac{\p{D|A}}{\p{D|B}} \]
2785
%
2786
Dividing through by \odds{A}, we get:
2787
%
2788
\[ \frac{\odds{A|D}}{\odds{A}} = \frac{\p{D|A}}{\p{D|B}} \]
2789
%
2790
The term on the left is the ratio of the posterior and prior odds.
2791
The term on the right is the likelihood ratio, also called the {\bf Bayes
2792
factor}.
2793
2794
\index{likelihood ratio}
2795
\index{Bayes factor}
2796
2797
If the Bayes factor is greater than 1, that means that the
2798
data were more likely under $A$ than under $B$.
2799
And that means that the odds are greater, in light of the data, than they were before.
2800
2801
If the Bayes factor is less than 1, that means the data were
2802
less likely under $A$ than under $B$, so the odds in
2803
favor of $A$ go down.
2804
2805
Finally, if the Bayes factor is exactly 1, the data are equally
2806
likely under either hypothesis, so the odds do not change.
2807
2808
Let's apply that to the problem at hand. If Oliver is
2809
one of the people who left blood at the crime scene, he
2810
accounts for the `O' sample; in that case, the probability of the data
2811
is the probability that a random member of the population
2812
has type `AB' blood, which is 1\%.
2813
2814
If Oliver did not leave blood at the scene, we have two
2815
samples to account for. If we choose two random people from
2816
the population, what is the chance of finding one with type `O'
2817
and one with type `AB'? Well, there are two ways it might happen:
2818
the first person might have type `O' and the second
2819
`AB', or the other way around. So the total probability is
2820
$2 (0.6) (0.01) = 1.2\%$.
2821
2822
The likelihood of the data is slightly higher if Oliver is
2823
{\it not} one of the people who left blood at the scene, so
2824
the blood data is actually evidence against Oliver's guilt.
2825
2826
\index{evidence}
2827
2828
This example is a little contrived, but it is demonstrates
2829
the counterintuitive result that data {\it consistent} with
2830
a hypothesis are not necessarily {\it in favor of}
2831
the hypothesis.
2832
2833
If this result still bothers you, this way of thinking might help: the data consist of a common event, type `O' blood, and a rare event, type `AB' blood.
2834
If Oliver accounts for the common event, that leaves the rare
2835
event unexplained. If Oliver doesn't account for the
2836
`O' blood, we have two chances to find someone in the
2837
population with `AB' blood. And that factor of two makes
2838
the difference.
2839
2840
2841
\section{Addends}
2842
\label{addends}
2843
2844
Suppose you roll two dice and add them up. What is the distribution of the sum?
2845
I'll use the following function to create a \py{Pmf} that represents the outcome of a die:
2846
2847
\begin{code}
2848
def make_die(sides):
2849
outcomes = np.arange(1, sides+1)
2850
die = Pmf(1/sides, outcomes)
2851
return die
2852
\end{code}
2853
2854
On a six-sided die, there are six possible outcomes, 1 through 6, all equally likely.
2855
2856
\begin{code}
2857
die = make_die(6)
2858
\end{code}
2859
2860
If we roll two dice and add them up, there are 11 possible outcomes, 2 through 12, but they are not equally likely.
2861
To compute the distribution of the sum, we can enumerate the possible outcomes.
2862
The following loop enumerates the quantities and probabilities from a \py{Pmf}:
2863
2864
\begin{code}
2865
for q, p in die.items():
2866
print(q, p)
2867
\end{code}
2868
2869
\py{items} iterates though the quantities and probabilities in the \py{Pmf}.
2870
So this loop enumerates all pairs of quantities and their probabilities:
2871
2872
\begin{code}
2873
for q1, p1 in pmf1.items():
2874
for q2, p2 in pmf2.items():
2875
q = q1 + q2
2876
p = p1 * p2
2877
\end{code}
2878
2879
Each time through the loop \py{q} gets the sum of the pair of quantities, and \py{p} gets the probability of the pair.
2880
Because the same sum might appear more than once, we have to add up the total probability for each sum.
2881
And that's how this function works:
2882
2883
\begin{code}
2884
def add_dist(pmf1, pmf2):
2885
res = Pmf()
2886
for q1, p1 in pmf1.items():
2887
for q2, p2 in pmf2.items():
2888
q = q1 + q2
2889
p = p1 * p2
2890
res[q] = res(q) + p
2891
return res
2892
\end{code}
2893
2894
The parameters are \py{Pmf} objects representing distributions.
2895
The first line creates an empty \py{Pmf}.
2896
Each time through the loop, we compute \py{q} and \py{p} and then increment the probability associated with \py{q}.
2897
2898
Notice a subtle element of this line:
2899
2900
\begin{code}
2901
res[q] = res(q) + p
2902
\end{code}
2903
2904
I use parentheses on the right side of the assignment, which returns 0 if \py{q} does not appear yet in \py{res}.
2905
I use brackets on the left side of the assignment to create or update an element in \py{res}; using parentheses on the left side would not work.
2906
2907
\py{Pmf} provides a method that does the same thing.
2908
You can call it as a method, like this.
2909
2910
\begin{code}
2911
twice = die.add_dist(die)
2912
\end{code}
2913
2914
Or as a function, like this:
2915
2916
\begin{code}
2917
twice = Pmf.add_dist(die, die)
2918
\end{code}
2919
2920
If we have a sequence of \py{Pmf} objects that represent dice, we can compute the distribution of the sum like this:
2921
2922
\begin{code}
2923
def add_dist_seq(seq):
2924
total = seq[0]
2925
for other in seq[1:]:
2926
total = total.add_dist(other)
2927
return total
2928
\end{code}
2929
2930
So we can compute the sum of three dice like this:
2931
2932
\begin{code}
2933
dice = [die] * 3
2934
thrice = add_dist_seq(dice)
2935
\end{code}
2936
2937
Figure~\ref{fig05-01} shows what these three distributions look like:
2938
2939
\begin{itemize}
2940
2941
\item The distribution of a single die is uniform from 1 to 6.
2942
2943
\item The sum of two dice has a triangle distribution between 2 and 12.
2944
2945
\item The sum of three dice has a bell-shaped distribution between 3 and 18.
2946
2947
\end{itemize}
2948
2949
\begin{figure}
2950
% chap05soln.ipynb
2951
\centerline{\includegraphics[width=4in]{figs/fig05-01.pdf}}
2952
\caption{Distribution of outcomes for one six-sided die, two dice, and three dice.}
2953
\label{fig05-01}
2954
\end{figure}
2955
2956
As an aside, this example demonstrates the Central Limit Theorem, which says that the distribution of a sum converges on a bell-shaped normal distribution, at least under some conditions.
2957
2958
\section{Gluten}
2959
2960
In 2015 I read a paper that tested whether people diagnosed with gluten sensitivity (but not celiac disease) were not able to distinguish gluten flour from non-gluten flour in a blind challenge (\url{https://onlinelibrary.wiley.com/doi/full/10.1111/apt.13372}).
2961
2962
Out of 35 subjects, 12 correctly identified the gluten flour based on resumption of symptoms while they were eating it. Another 17 wrongly identified the gluten-free flour based on their symptoms, and 6 were unable to distinguish.
2963
2964
The authors conclude, ``Double-blind gluten challenge induces symptom recurrence in just one-third of patients.''
2965
2966
This conclusion seems odd to me, because if none of the patients were sensitive to gluten, we would expect some of them to identify the gluten flour by chance.
2967
So here's the question: based on this data, how many of the subjects are sensitive to gluten?
2968
2969
We can use Bayes's Theorem to answer this question, but first we have to make some modeling decisions.
2970
I'll assume:
2971
2972
\begin{itemize}
2973
2974
\item People who are sensitive to gluten have a 95\% chance of correctly identifying gluten flour under the challenge conditions, and
2975
2976
\item People who are not sensitive have a 40\% chance of identifying the gluten flour by chance (and a 60\% chance of either choosing the other flour or failing to distinguish).
2977
2978
\end{itemize}
2979
2980
These particular values are arbitrary, but the results are not sensitive to these choices.
2981
2982
I will solve this problem in two steps. First, assuming that we know how many subjects are sensitive, I will compute the distribution of the data. Then, using the likelihood of the data, I will compute the posterior distribution of the number of sensitive patients.
2983
2984
The first is the {\bf forward problem}; the second is the {\bf inverse problem}.
2985
2986
2987
\section{Forward problem}
2988
2989
Suppose we know that 10 of the 35 subjects are sensitive to gluten. That means that 25 are not:
2990
2991
\begin{code}
2992
n = 35
2993
n_sensitive = 10
2994
n_insensitive = n - n_sensitive
2995
\end{code}
2996
2997
Each sensitive subject has a 95\% chance of identifying the gluten flour, so the number of correct identifications follows a binomial distribution with \py{p=0.95}:
2998
2999
\begin{code}
3000
dist_sensitive = make_binomial(n_sensitive, 0.95)
3001
\end{code}
3002
3003
And similarly for the insensitive subjects:
3004
3005
\begin{code}
3006
dist_insensitive = make_binomial(n_insensitive, 0.4)
3007
\end{code}
3008
3009
\py{make_binomial} returns a \py{Pmf} that represents the distribution of correct identifications.
3010
So we can use \py{add_dist} to compute the total number of correct identifications in both groups:
3011
3012
\begin{code}
3013
dist_total = Pmf.add_dist(dist_sensitive, dist_insensitive)
3014
\end{code}
3015
3016
Figure~\ref{fig05-02} shows the distribution of correct identifications among sensitive and insensitive subjects, and the total.
3017
3018
\begin{figure}
3019
% chap02soln.ipynb
3020
\centerline{\includegraphics[width=4in]{figs/fig05-02.pdf}}
3021
\caption{Distribution of correct identifications among sensitive and insensitive subjects, and the total.}
3022
\label{fig05-02}
3023
\end{figure}
3024
3025
Of the 10 sensitive subject, we expect most of them to identify the gluten flour correctly.
3026
Of the 25 insensitive subjects, we expect about 10 to identify the gluten flour by chance.
3027
So we expect about 20 correct identifications in total.
3028
3029
This is the answer to the forward problem: given the number of sensitive subjects, we can compute the distribution of the data.
3030
3031
\section{Inverse Problem}
3032
3033
Now let's solve the inverse problem: given the data, we'll compute the posterior distribution of the number of sensitive subjects.
3034
3035
Here's how. I'll loop through the possible values of \py{n_sensitive} and compute the distribution of the data for each:
3036
3037
\begin{code}
3038
table = pd.DataFrame()
3039
for n_sensitive in range(1, n):
3040
n_insensitive = n - n_sensitive
3041
3042
dist_sensitive = make_binomial(n_sensitive, 0.95)
3043
dist_insensitive = make_binomial(n_insensitive, 0.4)
3044
dist_total = Pmf.add_dist(dist_sensitive, dist_insensitive)
3045
table[n_sensitive] = dist_total
3046
\end{code}
3047
3048
I store each distribution as a column in a Pandas DataFrame.
3049
When \py{n_sensitive} is 0 or \py{n}, the distribution of the data is a simple binomial, not the sum of two binomials:
3050
3051
\begin{code}
3052
table[0] = make_binomial(n, 0.4)
3053
table[n] = make_binomial(n, 0.95)
3054
\end{code}
3055
3056
Figure~\ref{fig05-03} shows several columns from this table, corresponding to several hypothetical values of \py{n_sensitive}:
3057
3058
\begin{figure}
3059
% chap05soln.ipynb
3060
\centerline{\includegraphics[width=4in]{figs/fig05-03.pdf}}
3061
\caption{Distribution of the number of correct identification for different values of \py{n_sensitive}.}
3062
\label{fig05-03}
3063
\end{figure}
3064
3065
Now we can use this table to compute the likelihood of the data:
3066
3067
\begin{code}
3068
likelihood = table.loc[12]
3069
\end{code}
3070
3071
\py{loc} selects a row from the table.
3072
The row with index 12 contains the probability of 12 correct identifications for each hypothetical value of \py{n_sensitive}.
3073
And that's exactly the likelihood we need to do a Bayesian update.
3074
3075
I'll use a uniform prior, which implies that I would be equally surprised by any value of \py{n_sensitive}:
3076
3077
\begin{code}
3078
hypos = np.arange(n+1)
3079
prior = Pmf(1, hypos)
3080
\end{code}
3081
3082
And here's the update:
3083
3084
\begin{code}
3085
posterior = prior * likelihood
3086
posterior.normalize()
3087
\end{code}
3088
3089
Figure~\ref{fig05-04} shows posterior distributions of \py{n_sensitive} based on the actual data, 12 correct identifications, and another hypothetical outcome, 20 correct identifications.
3090
3091
\begin{figure}
3092
% chap05soln.ipynb
3093
\centerline{\includegraphics[width=4in]{figs/fig05-04.pdf}}
3094
\caption{Posterior distributions of \py{n_sensitive}.}
3095
\label{fig05-04}
3096
\end{figure}
3097
3098
With 12 correct identifications, the most likely conclusion is that none of the subjects are sensitive to gluten.
3099
If there had been 20 correct identifications, the most likely conclusion would be that 11-12 of the subjects were sensitive.
3100
3101
3102
\section{Summary}
3103
3104
This chapter presents two topics that are almost unrelated except that they make the title of the chapter catchy.
3105
3106
The first part of the chapter is about Bayes's Rule, evidence, and how we can quantify the strength of evidence using a likelihood ratio or Bayes factor.
3107
3108
The second part is about functions that compute the distribution of a sum, product, or the result of another binary operation.
3109
We can use these functions to solve a forward problem and inverse problems; that is, given the parameters of a system, we can compute the distribution of the data or, given the data, we can compute the distribution of the parameters.
3110
3111
In the following exercises, you'll have a chance to practice what you learned.
3112
3113
3114
\section{Exercises}
3115
3116
The code for this chapter is in \py{chap05.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
3117
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap05.ipynb}.
3118
3119
The notebook provides space where you can work on the following problems.
3120
3121
3122
\begin{exercise}
3123
Let's use Bayes's Rule to solve the Elvis problem from Section~\ref{elvis}:
3124
3125
\begin{quote}
3126
Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?
3127
\end{quote}
3128
3129
In 1935, about 2/3 of twins were fraternal and 1/3 were identical.
3130
The question contains two pieces of information we can use to update this prior.
3131
First, Elvis's twin was also male, which is more likely if they were identical twins, with a likelihood ratio of 2.
3132
Also, Elvis's twin died at birth, which is more likely if they were identical twins, with a likelihood ratio of 1.25.
3133
3134
If you are curious about where those number come from, I wrote a blog post about it at \url{https://www.allendowney.com/blog/2020/01/28/the-elvis-problem-revisited}.
3135
\end{exercise}
3136
3137
3138
\begin{exercise}
3139
The following is an interview question that appeared on glassdoor.com, attributed to Facebook (\url{https://www.glassdoor.com/Interview/You-re-about-to-get-on-a-plane-to-Seattle-You-want-to-know-if-you-should-bring-an-umbrella-You-call-3-random-friends-of-y-QTN_519262.htm}):
3140
3141
\begin{quote}
3142
You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that ``Yes'' it is raining. What is the probability that it's actually raining in Seattle?
3143
\end{quote}
3144
3145
Use Bayes's Rule to solve this problem. As a prior you can assume that it rains in Seattle about 10\% of the time.
3146
\end{exercise}
3147
3148
3149
\begin{exercise}
3150
According to the CDC, people who smoke are about 25 times more likely to develop lung cancer than nonsmokers (see \url{https://www.cdc.gov/tobacco/data_statistics/fact_sheets/health_effects/effects_cig_smoking/}).
3151
3152
Also according to the CDC, about 14\% of adults in the U.S. are smokers (see \url{https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm}).
3153
3154
If you learn that someone has lung cancer, what is the probability they are a smoker?
3155
\end{exercise}
3156
3157
3158
\begin{exercise}
3159
In {\it Dungeons~\&~Dragons}, the amount of damage a goblin can withstand is the sum of two six-sided dice. The amount of damage you inflict with a short sword is determined by rolling one six-sided die.
3160
A goblin is defeated if the total damage you inflict is greater than or equal to the amount it can withstand.
3161
3162
Suppose you are fighting a goblin and you have already inflicted 3 points of damage. What is your probability of defeating the goblin with your next successful attack?
3163
3164
Hint: You can use \py{Pmf.add_dist} to add a constant amount, like 3, to a \py{Pmf}.
3165
\end{exercise}
3166
3167
3168
\begin{exercise}
3169
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
3170
I choose one of the dice at random, roll it twice, multiply the outcomes, and report that the product is 12.
3171
What is the probability that I chose the 8-sided die?
3172
3173
Hint: \py{Pmf} provides a function called \py{mul_dist} that takes two \py{Pmf} objects and returns a \py{Pmf} that represents the distribution of the product.
3174
\end{exercise}
3175
3176
3177
\begin{exercise}
3178
{\it Betrayal at House on the Hill} is a strategy game in which characters with different attributes explore a haunted house. Depending on their attributes, the characters roll different numbers of dice. For example, if attempting a task that depends on knowledge, Professor Longfellow rolls 5 dice, Madame Zostra rolls 4, and Ox Bellows rolls 3. Each die yields 0, 1, or 2 with equal probability.
3179
3180
If a randomly chosen character attempts a task three times and rolls a total of 3 on the first attempt, 4 on the second, and 5 on the third, which character do you think it was?
3181
\end{exercise}
3182
3183
3184
\begin{exercise}
3185
There are 538 members of the United States Congress.
3186
Suppose we audit their investment portfolios and find that 312 of them outperform the market.
3187
Let's assume that an honest member of Congress has only a 50\% chance of outperforming the market, but a dishonest member who trades on inside information has a 90\% chance. How many members of Congress are honest?
3188
\end{exercise}
3189
3190
3191
3192
\chapter{Minima, Maxima, and Mixtures}
3193
3194
In the previous chapter we computed distributions of sums, differences, products, and quotients.
3195
3196
In this chapter, we'll compute distributions of minima and maxima use them to solve inference problems.
3197
Then we'll look at distributions that are mixtures of other distributions, which will turn out to be particularly useful for making predictions.
3198
3199
But we'll start with a powerful tool for working with distributions, the cumulative distribution function.
3200
3201
\section{Cumulative distribution functions}
3202
3203
So far we have been using probability mass functions to represent distributions.
3204
A useful alternative is the {\bf cumulative distribution function}, or CDF.
3205
3206
As an example, I'll use the posterior distribution from the Euro problem, which we computed in Section~\ref{binomlike}.
3207
3208
\begin{code}
3209
hypos = np.linspace(0, 1, 101)
3210
pmf = Pmf(1, hypos)
3211
data = 140, 250
3212
update_binomial(pmf, data)
3213
\end{code}
3214
3215
The CDF is the cumulative sum of the PMF, so we can compute it like this:
3216
3217
\begin{code}
3218
cumulative = pmf.cumsum()
3219
\end{code}
3220
3221
The result is a Pandas Series, so we can use the bracket operator to select an element:
3222
3223
\begin{code}
3224
cumulative[0.61]
3225
\end{code}
3226
3227
The result is about 0.96, which means that the total probability of all quantities less than or equal to 0.61 is 96\%.
3228
3229
To go the other way --- to look up a probability and get the corresponding quantile --- we can use interpolation:
3230
3231
\begin{code}
3232
from scipy.interpolate import interp1d
3233
3234
ps = cumulative.values
3235
qs = cumulative.index
3236
3237
interp = interp1d(ps, qs)
3238
interp(0.96)
3239
\end{code}
3240
3241
The result is about 0.61, so that confirms that the 96th percentile of this distribution is 0.61.
3242
3243
\py{empiricaldist} provides a class called \py{Cdf} that represents a cumulative distribution function.
3244
Given a \py{Pmf}, you can compute a \py{Cdf} like this:
3245
3246
\begin{code}
3247
cdf = pmf.make_cdf()
3248
\end{code}
3249
3250
\py{make_cdf} uses \py{np.cumsum} to compute the cumulative sum of the probabilities.
3251
Figure~\ref{fig06-01} shows the PMF and CDF of this distribution.
3252
The range of the CDF is always from 0 to 1, in contrast with the PMF, where the maximum can be any probability.
3253
3254
\begin{figure}
3255
% chap06soln.ipynb
3256
\centerline{\includegraphics[width=4in]{figs/fig06-01.pdf}}
3257
\caption{Posterior distribution from the Euro problem represented as a PMF and CDF.}
3258
\label{fig06-01}
3259
\end{figure}
3260
3261
You can use brackets to select an element from a \py{Cdf}:
3262
3263
\begin{code}
3264
cdf[0.61]
3265
\end{code}
3266
3267
But if you look up a value that's not in the distribution, you get a \py{KeyError}.
3268
You can also call a \py{Cdf} as a function, using parentheses.
3269
If the argument does not appear in the \py{Cdf}, it interpolates between quantities.
3270
3271
\begin{code}
3272
cdf(0.615)
3273
\end{code}
3274
3275
Going the other way, you can use \py{quantile} to look up a cumulative probability and get the corresponding quantity:
3276
3277
\begin{code}
3278
cdf.quantile(0.96)
3279
\end{code}
3280
3281
\py{Cdf} also provides \py{credible_interval}, which computes a credible interval that contains the given probability:
3282
3283
\begin{code}
3284
cdf.credible_interval(0.9)
3285
\end{code}
3286
3287
CDFs and PMFs are equivalent in the sense that they contain the
3288
same information about the distribution, and you can always convert
3289
from one to the other.
3290
Given a \py{Cdf}, you can get the equivalent \py{Pmf} like this:
3291
3292
\begin{code}
3293
pmf = cdf.make_pmf()
3294
\end{code}
3295
3296
\py{make_pmf} uses \py{np.diff} to compute differences between consecutive cumulative probabilities.
3297
3298
One reason \py{Cdf} objects are useful is that they compute quantiles efficiently.
3299
Another is that they make it easy to compute the distribution of a maximum or minimum, as we'll see in the next section.
3300
3301
3302
\section{Best Three of Four}
3303
3304
In {\it Dungeons~\&~Dragons}, each character has six attributes: strength, intelligence, wisdom, dexterity, constitution, and charisma.
3305
3306
To generate a new character, players roll four 6-sided dice for each attribute and add up the best three.
3307
For example, if I roll for strength and get 1, 2, 3, 4 on the dice, my character's strength would be 9.
3308
3309
As an exercise, let's figure out the distribution of these attributes.
3310
Then, for each character, we'll figure out the distribution of their best attribute.
3311
3312
In Section~\ref{addends}, we computed the distribution of the sum of three dice like this:
3313
3314
\begin{code}
3315
die = make_die(6)
3316
dice = [die] * 3
3317
pmf_3d6 = add_dist_seq(dice)
3318
\end{code}
3319
3320
The definitions of \py{make_die} and \py{add_dist_seq} are in that section.
3321
3322
But if we roll four dice and add up the best three, computing the distribution of the sum is a bit more complicated.
3323
I'll estimate the distribution by simulating 10,000 rolls.
3324
3325
First I'll create an array of random values from 1 to 6, with 10,000 rows and 4 columns:
3326
3327
\begin{code}
3328
n = 10000
3329
a = np.random.randint(1, 7, size=(n, 4))
3330
\end{code}
3331
3332
To find the best three outcomes in each row, I'll sort along \py{axis=1}, which means across the columns.
3333
3334
\begin{code}
3335
a.sort(axis=1)
3336
\end{code}
3337
3338
Finally, I'll select the last three columns and add them up.
3339
3340
\begin{code}
3341
t = a[:, 1:].sum(axis=1)
3342
\end{code}
3343
3344
Now \py{t} is an array with a single column and 10,000 rows.
3345
We can compute the PMF of the values in \py{t} like this:
3346
3347
\begin{code}
3348
pmf_4d6 = Pmf.from_seq(t)
3349
\end{code}
3350
3351
Figure~\ref{fig06-02} shows the distribution of the sum of three dice, \py{pmf_3d6}, and the distribution of the best three out of four, \py{pmf_4d6}.
3352
3353
\begin{figure}
3354
% chap06soln.ipynb
3355
\centerline{\includegraphics[width=4in]{figs/fig06-02.pdf}}
3356
\caption{Distributions of the sum of three dice and the best three of four.}
3357
\label{fig06-02}
3358
\end{figure}
3359
3360
As you might expect, choosing the best three out of four tends to yield higher values.
3361
3362
Next we'll find the distribution for the maximum of six attributes, each the sum of the best three of four dice.
3363
3364
3365
\section{Maximum of Six}
3366
3367
To compute the distribution of a maximum or minimum, we can make good use of the cumulative distribution function.
3368
First, I'll compute the \py{Cdf} of the best three of four distribution:
3369
3370
\begin{code}
3371
cdf_4d6 = pmf_4d6.make_cdf()
3372
\end{code}
3373
3374
Recall that \py{Cdf(x)} is the sum of probabilities for quantities less than or equal to \py{x}.
3375
Equivalently, it is the probability that a random value chosen from the distribution is less than or equal to \py{x}.
3376
3377
Now suppose I draw 6 values from this distribution.
3378
The probability that all 6 of them are less than or equal to \py{x} is \py{Cdf(x)} raised to the 6th power, which we can compute like this:
3379
3380
\begin{code}
3381
cdf_4d6**6
3382
\end{code}
3383
3384
If all 6 values are less than or equal to \py{x}, that means that their maximum is less than or equal to \py{x}.
3385
So the result is the CDF of their maximum.
3386
We can convert it to a \py{Cdf} object, like this:
3387
3388
\begin{code}
3389
cdf_max6 = Cdf(cdf_4d6**6)
3390
\end{code}
3391
3392
And compute the equivalent \py{Pmf} like this:
3393
3394
\begin{code}
3395
pmf_max6 = cdf_max6.make_pmf()
3396
\end{code}
3397
3398
Figure~\ref{fig06-03} shows the result.
3399
Most characters have at least one attribute greater than 12; almost 10\% of them have an 18.
3400
3401
\begin{figure}
3402
% chap06soln.ipynb
3403
\centerline{\includegraphics[width=4in]{figs/fig06-03.pdf}}
3404
\caption{Distribution for the minimum and maximum of six attributes.}
3405
\label{fig06-03}
3406
\end{figure}
3407
3408
\py{Pmf} and \py{Cdf} provide \py{max_dist}, which does the same computation.
3409
We can compute the \py{Pmf} of the maximum like this:
3410
3411
\begin{code}
3412
pmf_4d6.max_dist(6)
3413
\end{code}
3414
3415
And the \py{Cdf} of the maximum like this:
3416
3417
\begin{code}
3418
cdf_4d6.max_dist(6)
3419
\end{code}
3420
3421
In the next section we'll find the distribution of the minimum.
3422
The process is similar, but a little more complicated.
3423
See if you can figure it out before you go on.
3424
3425
3426
3427
3428
%In mathematical notation, we use $X$ to represent a random value from a %distribution, so we can write:
3429
%
3430
%\[ \CDF(x) = \p{X \le x} \]
3431
%
3432
3433
\section{Minimum of Six}
3434
3435
In the previous section we computed the distribution of a character's best attribute.
3436
Now let's compute the distribution of the worst.
3437
3438
To compute the distribution of the minimum, we'll use the {\bf complementary CDF}, which we can compute like this:
3439
3440
\begin{code}
3441
prob_gt = 1 - cdf_4d6
3442
\end{code}
3443
3444
As the variable name suggests, the complementary CDF is the probability that a value from the distribution is greater than \py{x}.
3445
If we draw 6 values from the distribution, the probability that all 6 exceed \py{x} is:
3446
3447
\begin{code}
3448
prob_gt6 = prob_gt**6
3449
\end{code}
3450
3451
If all 6 exceed \py{x}, that means their minimum exceeds \py{x}, so \py{prob_gt6} is the complementary CDF of the minimum.
3452
And that means we can compute the CDF of the minimum like this:
3453
3454
\begin{code}
3455
prob_le6 = 1 - prob_gt6
3456
\end{code}
3457
3458
The result is a Pandas Series that represents the CDF of the minimum of six attributes.
3459
We can put those values in a \py{Cdf} object like this:
3460
3461
\begin{code}
3462
cdf_min6 = Cdf(prob_le6)
3463
\end{code}
3464
3465
Figure~\ref{fig06-03} shows the result.
3466
3467
\py{Pmf} and \py{Cdf} provide \py{min_dist}, which does the same computation.
3468
We can compute the \py{Pmf} of the minimum like this:
3469
3470
\begin{code}
3471
pmf_4d6.min_dist(6)
3472
\end{code}
3473
3474
And the \py{Cdf} of the minimum like this:
3475
3476
\begin{code}
3477
cdf_4d6.min_dist(6)
3478
\end{code}
3479
3480
In the exercises at the end of the chapter, you'll use distributions of the minimum and maximum to do Bayesian inference.
3481
But first we'll see what happens when we mix distributions.
3482
3483
3484
\section{Mixtures}
3485
\label{mixture}
3486
3487
Let's do one more example inspired by {\it Dungeons~\&~Dragons}.
3488
Suppose I have a 4-sided die and a 6-sided die.
3489
I choose one of them at random and roll it.
3490
What is the distribution of the outcome?
3491
3492
If you know which die it is, the answer is easy.
3493
A die with \py{n} sides yields a uniform distribution from 1 to \py{n}, including both.
3494
We can compute \py{Pmf} objects to represent the dice, like this:
3495
3496
\begin{code}
3497
d4 = make_die(4)
3498
d6 = make_die(6)
3499
\end{code}
3500
3501
To compute the distribution of the mixture, we can compute the average of the two distributions by adding them and dividing the result by 2:
3502
3503
\begin{code}
3504
total = Pmf.add(d4, d6, fill_value=0) / 2
3505
\end{code}
3506
3507
We have to use \py{Pmf.add} with \py{fill_value=0} because the two distributions don't have the same set of quantities.
3508
If they did, we could use the \py{+} operator.
3509
3510
Now suppose I have a 4-sided die and {\it two} 6-sided dice.
3511
Again, I choose one of them at random and roll it.
3512
What is the distribution of the outcome?
3513
3514
We can solve this problem by computing a weighted average of the distributions, like this:
3515
3516
\begin{code}
3517
total = Pmf.add(d4, 2*d6, fill_value=0) / 3
3518
\end{code}
3519
3520
Finally, suppose we have a box with the following mix:
3521
3522
\begin{verbatim}
3523
1 4-sided die
3524
2 6-sided dice
3525
3 8-sided dice
3526
\end{verbatim}
3527
3528
If I draw a die from this mix at random, we can use a \py{Pmf} to represent the hypothetical number of sides on the die:
3529
3530
\begin{code}
3531
hypos = [4,6,8]
3532
counts = [1,2,3]
3533
pmf_dice = Pmf(counts, hypos)
3534
\end{code}
3535
3536
And I'll make a sequence of \py{Pmf} objects to represent the dice:
3537
3538
\begin{code}
3539
dice = [make_die(sides) for sides in hypos]
3540
\end{code}
3541
3542
Now we have to multiply each distribution in \py{dice} by the corresponding probabilities in \py{pmf_dice}.
3543
To express this computation concisely, it is convenient to put the distributions into a Pandas DataFrame:
3544
3545
\begin{code}
3546
pd.DataFrame(dice)
3547
\end{code}
3548
3549
The result is a DataFrame with one row for each distribution and one column for each possible outcome.
3550
Not all rows are the same length, so Pandas fills the extra spaces with the special value \py{NaN}, which stands for ``not a number''.
3551
We can use `fillna` to replace the \py{NaN} values with 0.
3552
3553
\begin{code}
3554
pd.DataFrame(dice).fillna(0)
3555
\end{code}
3556
3557
Before we multiply by the probabilities in \py{pmf_dice}, we have to transpose the matrix so the distributions run down the columns rather than across the rows:
3558
3559
\begin{code}
3560
df = pd.DataFrame(dice).fillna(0).transpose()
3561
\end{code}
3562
3563
Now we can multiply by the probabilities:
3564
3565
\begin{code}
3566
df *= pmf_dice.ps
3567
\end{code}
3568
3569
And add up the weighted distributions:
3570
3571
\begin{code}
3572
total = df.sum(axis=1)
3573
\end{code}
3574
3575
The argument \py{axis=1} means we want to sum across the rows.
3576
The result is a Pandas Series.
3577
3578
Putting it all together, here's a function that makes a weighted mixture of distributions.
3579
3580
\begin{code}
3581
def make_mixture(pmf, pmf_seq):
3582
df = pd.DataFrame(pmf_seq).fillna(0).transpose()
3583
df *= pmf.ps
3584
total = df.sum(axis=1)
3585
return Pmf(total)
3586
\end{code}
3587
3588
%TODO: Add make_mixture to empiricaldist
3589
3590
The first parameter is a \py{Pmf} that makes from each hypothesis to a probability.
3591
The second parameter is a sequence of \py{Pmf} objects, one for each hypothesis.
3592
We can call it like this:
3593
3594
\begin{code}
3595
mix = make_mixture(pmf_dice, dice)
3596
\end{code}
3597
3598
Figure~\ref{fig06-04} shows the result, which is a mixture of uniform distributions.
3599
3600
\begin{figure}
3601
% chap06soln.ipynb
3602
\centerline{\includegraphics[width=4in]{figs/fig06-04.pdf}}
3603
\caption{Mixture of uniform distributions from three kinds of dice.}
3604
\label{fig06-04}
3605
\end{figure}
3606
3607
3608
3609
\section{Summary}
3610
3611
3612
We have seen two representations of distributions: Pmfs and Cdfs.
3613
These representations are equivalent in the sense that they contain
3614
the same information, so you can convert from one to the other. The
3615
primary difference between them is performance: some operations are
3616
faster and easier with a Pmf; others are faster with a Cdf.
3617
\index{Pmf} \index{Cdf}
3618
3619
3620
In this chapter we used `Cdf` objects to compute distributions of maxima and minima; these distributions are useful for inference if we are given a maximum or minimum as data.
3621
3622
We also computed mixtures of distributions, which we will use in the next chapter to make predictions.
3623
3624
3625
\section{Exercises}
3626
3627
The code for this chapter is in \py{chap06.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
3628
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap06.ipynb}.
3629
3630
The notebook provides space where you can work on the following problems.
3631
3632
3633
\begin{exercise}
3634
When you generate a {\it Dungeons~\&~Dragons} character, instead of rolling dice, you can use the "standard array" of attributes, which is 15, 14, 13, 12, 10, and 8.
3635
3636
Do you think you are better off using the standard array or (literally) rolling the dice?
3637
3638
Compare the distribution of the values in the standard array to the distribution we computed for the best three out of four:
3639
3640
\begin{itemize}
3641
3642
\item Which distribution has higher mean? Use the \py{mean} method.
3643
3644
\item Which distribution has higher standard deviation? Use the \py{std} method.
3645
3646
\item The lowest value in the standard array is 8. For each attribute, what is the probability of getting a value less than 8? If you roll the dice six times, what's the probability that at least one of your attributes is less than 8?
3647
3648
\item The highest value in the standard array is 15. For each attribute, what is the probability of getting a value greater than 15? If you roll the dice six times, what's the probability that at least one of your attributes is greater than 15?
3649
3650
\end{itemize}
3651
3652
\end{exercise}
3653
3654
3655
\begin{exercise}
3656
Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
3657
I choose one of the dice at random, roll it, and report that the outcome is a 1.
3658
If I roll the same die again, what is the probability that I get another 1?
3659
3660
Hint: Compute the posterior distribution as we have done before and pass it as one of the arguments to \py{make_mixture}.
3661
\end{exercise}
3662
3663
3664
\begin{exercise}
3665
Suppose I have two boxes of dice:
3666
3667
\begin{itemize}
3668
\item One contains a 4-sided die and a 6-sided die.
3669
3670
\item The other contains a 6-sided die and an 8-sided die.
3671
\end{itemize}
3672
3673
I choose a box at random, choose a die, and roll it 3 times. If I get 2, 4, and 6, which box do you think I chose?
3674
\end{exercise}
3675
3676
3677
\newcommand{\Poincare}{Poincar\'{e}}
3678
3679
\begin{exercise}
3680
Henri \Poincare~was a French mathematician who taught at the Sorbonne around 1900. The following anecdote about him is probably fabricated, but it makes an interesting probability problem.
3681
3682
Supposedly \Poincare~suspected that his local bakery was selling loaves of bread that were lighter than the advertised weight of 1 kg, so every day for a year he bought a loaf of bread, brought it home and weighed it. At the end of the year, he plotted the distribution of his measurements and showed that it fit a normal distribution with mean 950 g and standard deviation 50 g. He brought this evidence to the bread police, who gave the baker a warning.
3683
3684
For the next year, \Poincare~continued the practice of weighing his bread every day. At the end of the year, he found that the average weight was 1000 g, just as it should be, but again he complained to the bread police, and this time they fined the baker.
3685
3686
Why? Because the shape of the distribution was asymmetric. Unlike the normal distribution, it was skewed to the right, which is consistent with the hypothesis that the baker was still making 950 g loaves, but deliberately giving \Poincare~the heavier ones.
3687
3688
To see whether this anecdote is plausible, let's suppose that when the baker sees \Poincare~coming, he hefts \py{n} loaves of bread and gives \Poincare~the heaviest one. How many loaves would the baker have to heft to make the average of the maximum 1000 g?
3689
\end{exercise}
3690
3691
3692
\begin{exercise}
3693
Two doctors fresh out of medical school are arguing about whose hospital delivers more babies. The first doctor say, ``I've been at Hospital A for two weeks, and already we've had a day when we delivered 20 babies.''
3694
3695
The second doctor says, ``I've only been at Hospital B for one week, but already there's been a 19-baby day.''
3696
3697
Which hospital do you think delivers more babies on average? You can assume that the number of babies born in a day is well modeled by a Poisson distribution with parameter $\lambda$ (see \url{https://en.wikipedia.org/wiki/Poisson_distribution}).
3698
3699
\end{exercise}
3700
3701
3702
\begin{exercise}
3703
This question is related to a method I developed for estimating the minimum time for a packet of data to travel through a path in the internet.
3704
3705
Suppose I drive the same route three times and the fastest of the three attempts takes 8 minutes.
3706
3707
There are two traffic lights on the route. As I approach each light, there is a 40\% chance that it is green; in that case, it causes no delay. And there is a 60\% change it is red; in that case it causes a delay that is uniformly distributed from 0 to 60 seconds.
3708
3709
What is the posterior distribution of the time it would take to drive the route with no delays?
3710
\end{exercise}
3711
3712
3713
3714
\chapter{Poisson Processes}
3715
\label{prediction}
3716
3717
\newcommand{\lam}{\mathtt{\lambda}}
3718
3719
\section{The World Cup Problem}
3720
3721
In the 2018 FIFA World Cup final, France defeated Croatia 4 goals to 2. Based on this outcome:
3722
3723
\begin{enumerate}
3724
3725
\item How confident should we be that France is the better team?
3726
3727
\item If the same teams played again, what is the chance France would win again?
3728
3729
\end{enumerate}
3730
3731
3732
To answer these questions, we have to make some modeling decisions.
3733
3734
First, I'll assume that for any team against any other team there is some unknown goal-scoring rate, measured in goals per game, which I'll denote
3735
$\lam$.
3736
3737
Second, I'll assume that a goal is equally likely during any minute of a game. So, in a 90 minute game, the probability of scoring during any minute is $\lam / 90$.
3738
3739
Third, I'll assume that a team never scores twice during the same minute.
3740
3741
Of course, none of these assumptions is absolutely true in the real world, but I think they are reasonable simplifications, and as we will see, they allow use to derive some useful results.
3742
As George Box said, ``All models are wrong; some are useful''
3743
(see \url{https://en.wikipedia.org/wiki/All_models_are_wrong}).
3744
3745
My strategy for answering this question is
3746
3747
\begin{enumerate}
3748
3749
\item Use statistics from previous games to choose a prior
3750
distribution for $\lam$.
3751
3752
\item Use the score from the game to estimate $\lam$ for each team.
3753
3754
\item Use the posterior distributions of $\lam$ to compute
3755
distribution of goals for each team and the probability that each team wins
3756
the next game.
3757
3758
\end{enumerate}
3759
3760
\section{Poisson processes}
3761
3762
In mathematical statistics, a {\bf process} is a stochastic model of a
3763
physical system (``stochastic'' means that the model has some kind of
3764
randomness in it).
3765
3766
For example, a {\bf Bernoulli process} is a model of a
3767
sequence of events, called trials, in which each trial has two
3768
possible outcomes, usually called success and failure.
3769
So a Bernoulli process
3770
is a natural model for a series of coin flips, or a series of shots on
3771
goal.
3772
\index{process}
3773
\index{Bernoulli process}
3774
3775
A {\bf Poisson process} is the continuous version of a Bernoulli process,
3776
where an event can occur at any point in time with equal probability.
3777
Poisson processes can be used to model customers arriving in a store,
3778
buses arriving at a bus stop, or goals scored in a soccer game.
3779
\index{Poisson process}
3780
3781
In many real systems the probability of an event changes over time.
3782
Customers are more likely to go to a store at certain times of day,
3783
buses are supposed to arrive at fixed intervals, and goals are more
3784
or less likely at different times during a game.
3785
3786
But all models are based on simplifications, and in this case modeling
3787
a soccer game with a Poisson process is a reasonable choice. Heuer,
3788
M\"{u}ller and Rubner (2010) analyze scoring in a German soccer league
3789
and come to the same conclusion (see
3790
\url{http://www.cimat.mx/Eventos/vpec10/img/poisson.pdf}).
3791
3792
The benefit of using this model is that we can compute the distribution
3793
of goals per game efficiently, as well as the distribution of time
3794
between goals. Specifically, if the average number of goals
3795
in a game is $\lam$, the distribution of goals per game is
3796
given by the Poisson PMF:
3797
\index{Poisson distribution}
3798
%
3799
\[ f(k; \lam) = \lam^k \exp(-\lam) ~/~ k! \]
3800
%
3801
And the distribution of time between goals is given by the
3802
exponential PDF:
3803
\index{exponential distribution}
3804
%
3805
\[ f(t; \lam) = \lam \exp(-\lam t) \]
3806
%
3807
Let's start with the Poisson distribution.
3808
3809
3810
\section{The Poisson Distribution}
3811
3812
Suppose we know that the goal-scoring rate for one team against another is $\lam = 1.4$ goals per game.
3813
The following function computes the Poisson distribution of \py{k}, the number of goals the team scores in one game.
3814
3815
\begin{code}
3816
from scipy.stats import poisson
3817
3818
def make_poisson_pmf($\lam$, high):
3819
qs = np.arange(high)
3820
ps = poisson.pmf(qs, $\lam$)
3821
pmf = Pmf(ps, qs)
3822
pmf.normalize()
3823
return pmf
3824
\end{code}
3825
3826
The first parameter is the goal-scoring rate.
3827
The second is the upper bound of the distribution.
3828
In theory the Poisson distribution goes to infinity, but we can cut if off when we get to quantities with negligible probability.
3829
3830
As usual, the \py{qs} are the quantities in the distribution and the \py{ps} are their probabilities.
3831
SciPy provides \py{poisson}, which has a function called \py{pmf} that evaluates the PMF of the Poisson distribution.
3832
3833
The return value is a normalized \py{Pmf}.
3834
We can call \py{make_poisson_pmf} like this:
3835
3836
\begin{code}
3837
pmf_goals = make_poisson_pmf($\lam$=1.4, high=10)
3838
\end{code}
3839
3840
3841
\begin{figure}
3842
% chap07soln.ipynb
3843
\centerline{\includegraphics[width=4in]{figs/fig07-01.pdf}}
3844
\caption{Poisson distribution with $\lam=1.4$.}
3845
\label{fig07-01}
3846
\end{figure}
3847
3848
Figure~\ref{fig07-01} shows the result, a Poisson distribution with $\lam=1.4$.
3849
The most likely outcomes are 0, 1, and 2; higher values are possible but increasingly unlikely.
3850
Values above 7 are negligible.
3851
3852
If we know the goal scoring rate, we can predict the number of goals.
3853
Now let's turn it around: given a number of goals, what can we say about the goal-scoring rate?
3854
3855
To answer that, we need to think about the prior distribution of $\lam$.
3856
And for that, I am going to use a Gamma distribution.
3857
3858
3859
\section{The Gamma Distribution}
3860
3861
If you have ever seen a soccer game, you have some information about $\lam$.
3862
In most games, teams score a few goals each.
3863
In rare cases, a team might score more than 5 goals, but they almost never score more than 10.
3864
3865
Using data from previous World Cups
3866
I estimate that each team scores about 1.4 goals per game, on average (see \url{https://www.statista.com/statistics/269031/goals-scored-per-game-at-the-fifa-world-cup-since-1930/}). So I'll set the mean of $\lam$ to be 1.4.
3867
3868
For a good team against a bad one, we expect $\lam$ to be higher; for a bad team against a good one, we expect it to be lower.
3869
3870
To model the distribution of goal-scoring rates, I will use a gamma distribution, which I chose because:
3871
3872
\begin{enumerate}
3873
3874
\item The goal scoring rate is a continuous quantity that cannot be less than 0; the gamma distribution is appropriate for this kind of quantity.
3875
3876
\item The gamma distribution has only one parameter, $\alpha$, which is the mean. So it's easy to construct a gamma distribution with the mean we want.
3877
3878
\item As we'll see, the shape of the Gamma distribution is a reasonable choice, given what we know about soccer.
3879
3880
\end{enumerate}
3881
3882
For more about the gamma distribution, see \url{https://en.wikipedia.org/wiki/Gamma_distribution}.
3883
3884
The gamma distribution is continuous, but we'll approximate it with a discrete \py{Pmf}.
3885
SciPy provides \py{gamma}, which provides \py{pdf}, which evaluates the {\bf probability density function} (PDF) of the gamma distribution.
3886
3887
\newcommand{\alf}{\mathtt{\alpha}}
3888
3889
\begin{code}
3890
from scipy.stats import gamma
3891
3892
$\alf$ = 1.4
3893
qs = np.linspace(0, 10, 101)
3894
ps = gamma.pdf(qs, $\alf$)
3895
\end{code}
3896
3897
The \py{qs} are possible values of $\lam$ from 0 to 10.
3898
The \py{ps} are probability densities, which we can think of as unnormalized probabilities.
3899
If we put the densities in a \py{Pmf} and normalize them, like this:
3900
3901
\begin{code}
3902
prior = Pmf(ps, qs)
3903
prior.normalize()
3904
\end{code}
3905
3906
The result is a discrete approximation of a continuous distribution.
3907
Figure~\ref{fig07-02} shows what it looks like.
3908
3909
\begin{figure}
3910
% chap07soln.ipynb
3911
\centerline{\includegraphics[width=4in]{figs/fig07-02.pdf}}
3912
\caption{A gamma prior distribution of goal-scoring rate.}
3913
\label{fig07-02}
3914
\end{figure}
3915
3916
This distribution represents our prior knowledge about goal scoring: $\lam$ is usually less than 2, occasionally as high as 6, and seldom higher than that. And the mean is about 1.4.
3917
3918
As usual, reasonable people could disagree about the details of the prior, but this is good enough to get started.
3919
Let's do an update.
3920
3921
3922
3923
\section{Update}
3924
3925
Now that we have a prior, the next step is to compute the likelihood of the data.
3926
For France, the data is the number of goals scored, 4.
3927
We can use the Poisson distribution to compute the likelihoods:
3928
3929
\begin{code}
3930
$\lam$s = prior.qs
3931
k = 4
3932
likelihood = poisson.pmf(k, $\lam$s)
3933
\end{code}
3934
3935
The result is a NumPy array with the likelihood of the data for each hypothetical value of $\lam$.
3936
So we can do the update like this:
3937
3938
\begin{code}
3939
def update_poisson(pmf, data):
3940
k = data
3941
$\lam$s = pmf.qs
3942
likelihood = poisson.pmf(k, $\lam$s)
3943
pmf *= likelihood
3944
pmf.normalize()
3945
\end{code}
3946
3947
The first parameter is the prior; the second is the number of goals.
3948
We can use this function to compute posterior distributions for France and Croatia:
3949
3950
\begin{code}
3951
france = prior.copy()
3952
update_poisson(france, 4)
3953
3954
croatia = prior.copy()
3955
update_poisson(croatia, 2)
3956
\end{code}
3957
3958
Figure~\ref{fig07-03} shows the results.
3959
3960
\begin{figure}
3961
% chap07soln.ipynb
3962
\centerline{\includegraphics[width=4in]{figs/fig07-03.pdf}}
3963
\caption{}
3964
\label{fig07-03}
3965
\end{figure}
3966
3967
Recall that the mean of the prior distribution is 1.4.
3968
After Croatia scores 2 goals, their posterior mean is 1.7, which is near the midpoint of the prior and the date.
3969
Likewise after France scores 4 goals, their posterior mean is 2.7.
3970
3971
These results are typical of a Bayesian update: the location of the posterior distribution is a compromise between the prior and the data.
3972
3973
3974
\section{Probability of Superiority}
3975
3976
Now that we have a posterior distribution for each team, we can answer the first question: How confident should we be that France is the better team?
3977
3978
In the model, ``better'' means having a higher goal-scoring rate against the opponent.
3979
We can use the posterior distributions to compute the probability that a random value drawn from France's distribution exceeds a value drawn from Croatia's.
3980
3981
One way to do that is to enumerate all pairs of values from the two distributions, adding up the total probability that one value exceeds the other, as in this function:
3982
3983
\begin{code}
3984
def prob_gt(pmf1, pmf2):
3985
total = 0
3986
for q1, p1 in pmf1.items():
3987
for q2, p2 in pmf2.items():
3988
if q1 > q2:
3989
total += p1 * p2
3990
return total
3991
\end{code}
3992
3993
This is similar to the method we use in Section~\ref{addends} to compute the distribution of a sum.
3994
Here's how we use it:
3995
3996
\begin{code}
3997
prob_gt(france, croatia)
3998
\end{code}
3999
4000
\py{Pmf} provides a function that does the same thing, which we can call like this:
4001
4002
\begin{code}
4003
Pmf.prob_gt(france, croatia)
4004
\end{code}
4005
4006
The result is close to 75\%. So, on the basis of this game, we are reasonably confident that France is the better team.
4007
4008
Of course, we should remember that this result is based on the assumption that the goal-scoring rate is constant.
4009
In reality, if a team is down by one goal, they might play more aggressively toward the end of the game, making them more likely to score, but also more likely to give up an additional goal.
4010
4011
As always, the results are only as good as the model.
4012
4013
4014
\section{The distribution of goals}
4015
4016
Now we can take on the second question: If the same teams played again, what is the chance France would win the rematch?
4017
4018
To answer this question, we'll generate a {\bf posterior predictive distribution} for each team, which is the number of goals we expect them to score.
4019
4020
If we knew the goal scoring rate, $\lam$, the distribution of goals would be a Poisson distribution with parameter $\lam$.
4021
4022
Since we don't know $\lam$, the distribution of goals is a mixture of a Poisson distributions with different values of $\lam$.
4023
4024
First I'll generate a sequence of Poisson distributions, one for each hypothetical value of $\lam$:
4025
4026
\begin{code}
4027
pmf_seq = [make_poisson_pmf($\lam$, 12) for $\lam$ in prior.qs]
4028
\end{code}
4029
4030
Now we can use \py{make_mixture} from Section~\ref{mixture} to compute posterior predictive distributions for France and Croatia:
4031
4032
\begin{code}
4033
pred_france = make_mixture(france, pmf_seq)
4034
pred_croatia = make_mixture(croatia, pmf_seq)
4035
\end{code}
4036
4037
Figure~\ref{fig07-04} shows posterior predictive distributions for the number of goals in a rematch.
4038
4039
\begin{figure}
4040
% chap07soln.ipynb
4041
\centerline{\includegraphics[width=5.5in]{figs/fig07-04.pdf}}
4042
\caption{Posterior predictive distributions for the number of goals in a rematch.}
4043
\label{fig07-04}
4044
\end{figure}
4045
4046
These distributions represent two sources of uncertainty: we don't know the actual value of $\lam$, and even if we did, we would not know the number of goals in the next game.
4047
4048
We can use these distributions to compute the probability that France wins, loses, or ties the rematch:
4049
4050
\begin{code}
4051
win = Pmf.prob_gt(pred_france, pred_croatia)
4052
lose = Pmf.prob_lt(pred_france, pred_croatia)
4053
tie = Pmf.prob_eq(pred_france, pred_croatia)
4054
\end{code}
4055
4056
Assuming that France wins half of the ties, their chance of winning the rematch is about 65\%.
4057
This is a bit lower than their probability of superiority, which is 75\%. And that makes sense even if they are better team, they might lose the game.
4058
4059
4060
\section{The Exponential Distribution}
4061
\label{exponential}
4062
4063
As an exercise at the end of this chapter, you'll have a chance to work on this variation on the World Cup Problem:
4064
4065
\begin{quote}
4066
In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark.
4067
At that point in the match, how many goals would you expect Germany to score after 90 minutes?
4068
What was the probability that they would score 5 more goals (as, in fact, they did)?
4069
\end{quote}
4070
4071
In this version, notice that the data is not the number of goals in a fixed period of time but the time between goals.
4072
4073
To compute the likelihood of data like this, we can use the theory of Poisson processes again.
4074
In our model of a soccer game, we assume that each team has a goal-scoring rate, $\lam$, in goals per game.
4075
And we assume that $\lam$ is constant, so the chance of scoring a goal in the same at any moment of the game.
4076
4077
Under these assumptions, the time between goals follows an exponential distribution (see \url{https://en.wikipedia.org/wiki/Exponential_distribution}).
4078
If the goal-scoring rate is $\lam$, the probability of seeing an interval between goals of $t$ is proportional to the PDF of the exponential distribution:
4079
4080
$f(t; \lam) = \lam \exp(-\lam t)$
4081
4082
Because $t$ is a continuous quantity, the value of this expression is not a probability; it is a probability density.
4083
However, it is proportional to the probability of the data, so we can use it as a likelihood in a Bayesian update.
4084
4085
The following function computes this PDF:
4086
4087
\begin{code}
4088
def expo_pdf(t, $\lam$):
4089
return $\lam$ * np.exp(-$\lam$ * t)
4090
\end{code}
4091
4092
To see what exponential distributions look like, let's assume again that $\lam$ is 1.4; we can compute the distribution of $t$ like this:
4093
4094
\begin{code}
4095
$\lam$ = 1.4
4096
qs = np.linspace(0, 4, 101)
4097
ps = expo_pdf(qs, $\lam$)
4098
pmf_time = Pmf(ps, qs)
4099
pmf_time.normalize()
4100
\end{code}
4101
4102
\begin{figure}
4103
% chap01soln.ipynb
4104
\centerline{\includegraphics[width=4in]{figs/fig07-05.pdf}}
4105
\caption{An exponential distribution with $\lam = 1.4$.}
4106
\label{fig07-05}
4107
\end{figure}
4108
4109
Figure~\ref{fig07-05} shows the result.
4110
It is counterintuitive, but true, that the most likely time to score a goal is immediately. After that, the probability of each possible interval is a little lower.
4111
4112
With a goal-scoring rate of 1.4, it is possible that a team will take more than one game to score a goal, but it is unlikely that they will take more than two games.
4113
4114
4115
\section{Summary}
4116
4117
4118
4119
\section{Exercises}
4120
4121
The code for this chapter is in \py{chap07.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
4122
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap07.ipynb}.
4123
4124
The notebook provides space where you can work on the following problems.
4125
4126
4127
\begin{exercise}
4128
Finish off the exercise from Section~\ref{exponential}:
4129
4130
\begin{quote}
4131
In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark.
4132
At that point in the match, how many goals would you expect Germany to score after 90 minutes?
4133
What was the probability that they would score 5 more goals (as, in fact, they did)?
4134
\end{quote}
4135
4136
\end{exercise}
4137
4138
\begin{exercise}
4139
\end{exercise}
4140
4141
4142
4143
\begin{exercise}
4144
In the 2010-11 National Hockey League (NHL) Finals, my beloved Boston
4145
Bruins played a best-of-seven championship series against the despised
4146
Vancouver Canucks. Boston lost the first two games 0-1 and 2-3, then
4147
won the next two games 8-1 and 4-0. At this point in the series, what
4148
is the probability that Boston will win the next game, and what is
4149
their probability of winning the championship?
4150
4151
To choose a prior distribution, I got some statistics from
4152
\url{http://www.nhl.com}, specifically the average goals per game
4153
for each team in the 2010-11 season. The distribution well modeled by a gamma distribution with mean 2.8.
4154
\index{National Hockey League}
4155
\index{NHL}
4156
\index{hockey}
4157
\index{Boston Bruins}
4158
\index{Vancouver Canucks}
4159
\end{exercise}
4160
4161
4162
4163
4164
4165
\begin{exercise}
4166
4167
If buses arrive at a bus stop every 20 minutes, and you
4168
arrive at the bus stop at a random time, your wait time until
4169
the bus arrives is uniformly distributed from 0 to 20 minutes.
4170
\index{bus stop problem}
4171
4172
But in reality, there is variability in the time between
4173
buses. Suppose you are waiting for a bus, and you know the historical
4174
distribution of time between buses. Compute your distribution
4175
of wait times.
4176
4177
Hint: Suppose that the time between buses is either
4178
5 or 10 minutes with equal probability. What is the probability
4179
that you arrive during one of the 10 minute intervals?
4180
4181
I solve a version of this problem in the next chapter.
4182
4183
\end{exercise}
4184
4185
4186
\begin{exercise}
4187
4188
Suppose that passengers arriving at the bus stop are well-modeled
4189
by a Poisson process with parameter $\lam$. If you arrive at the
4190
stop and find 3 people waiting, what is your posterior distribution
4191
for the time since the last bus arrived.
4192
\index{Poisson process}
4193
\index{bus stop problem}
4194
4195
I solve a version of this problem in the next chapter.
4196
4197
\end{exercise}
4198
4199
4200
\begin{exercise}
4201
4202
Suppose that you are an ecologist sampling the insect population in
4203
a new environment. You deploy 100 traps in a test area and come back
4204
the next day to check on them. You find that 37 traps have been
4205
triggered, trapping an insect inside. Once a trap triggers, it
4206
cannot trap another insect until it has been reset.
4207
\index{insect sampling problem}
4208
4209
If you reset the traps and come back in two days, how many traps
4210
do you expect to find triggered? Compute a posterior predictive
4211
distribution for the number of traps.
4212
\index{predictive distribution}
4213
4214
\end{exercise}
4215
4216
4217
\begin{exercise}
4218
4219
Suppose you are the manager of an apartment building with
4220
100 light bulbs in common areas. It is your responsibility
4221
to replace light bulbs when they break.
4222
\index{light bulb problem}
4223
4224
On January 1, all 100 bulbs are working. When you inspect
4225
them on February 1, you find 3 light bulbs out. If you
4226
come back on April 1, how many light bulbs do you expect to
4227
find broken?
4228
4229
In the previous exercise, you could reasonably assume that an event is
4230
equally likely at any time. For light bulbs, the likelihood of
4231
failure depends on the age of the bulb. Specifically, old bulbs
4232
have an increasing failure rate due to evaporation of the filament.
4233
4234
This problem is more open-ended than some; you will have to make
4235
modeling decisions. You might want to read about the Weibull
4236
distribution
4237
(\url{http://en.wikipedia.org/wiki/Weibull_distribution}).
4238
Or you might want to look around for information about
4239
light bulb survival curves.
4240
\index{Weibull distribution}
4241
4242
\end{exercise}
4243
4244
4245
4246
\chapter{Decision Analysis}
4247
\label{decisionanalysis}
4248
4249
In this chapter....
4250
4251
... we estimate the price of prizes on a game show.
4252
Once we compute a posterior distribution, we'll use it to optimize a decision-making process.
4253
4254
This example demonstrates the real power of Bayesian methods, not just computing posterior distributions, but using them to make better decisions.
4255
4256
4257
\section{The {\it Price is Right} problem}
4258
4259
On November 1, 2007, contestants named Letia and Nathaniel appeared
4260
on {\it The Price is Right}, an American game show. They competed in
4261
a game called {\it The Showcase}, where the objective is to guess the price
4262
of a showcase of prizes. The contestant who comes closest to the
4263
actual price of the showcase, without going over, wins the prizes.
4264
4265
\index{Price is Right}
4266
\index{Showcase}
4267
4268
Nathaniel went first. His showcase included a dishwasher, a wine
4269
cabinet, a laptop computer, and a car. He bid \$26,000.
4270
4271
Letia's showcase included a pinball machine, a video arcade game, a
4272
pool table, and a cruise of the Bahamas. She bid \$21,500.
4273
4274
The actual price of Nathaniel's showcase was \$25,347. His bid
4275
was too high, so he lost.
4276
4277
The actual price of Letia's showcase was \$21,578. She was only
4278
off by \$78, so she won her showcase and, because
4279
her bid was off by less than \$250, she also won Nathaniel's
4280
showcase.
4281
4282
For a Bayesian thinker, this scenario suggests several questions:
4283
4284
\begin{enumerate}
4285
4286
\item Before seeing the prizes, what prior beliefs should the
4287
contestant have about the price of the showcase?
4288
4289
\item After seeing the prizes, how should the contestant update
4290
those beliefs?
4291
4292
\item Based on the posterior distribution, what should the
4293
contestant bid?
4294
4295
\end{enumerate}
4296
4297
The third question demonstrates a common use of Bayesian analysis:
4298
decision analysis. Given a posterior distribution, we can choose
4299
the bid that maximizes the contestant's expected return.
4300
4301
\index{decision analysis}
4302
4303
This problem is inspired by an example in Cameron Davidson-Pilon's
4304
book, {\it Probablistic Programming and Bayesian Methods for Hackers}
4305
(see \url{http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers}).
4306
4307
\index{Davidson-Pilon, Cameron}
4308
4309
4310
\section{The prior}
4311
4312
To choose a prior distribution of prices, we can take advantage
4313
of data from previous episodes.
4314
Fortunately, fans of the show keep detailed records (see \url{https://web.archive.org/web/20121107204942/http://www.tpirsummaries.8m.com/}).
4315
4316
For this example, I downloaded files containing the price of each showcase from the 2011 and 2012 seasons and the bids offered by the contestants.
4317
4318
This dataset contains the prices for 313 previous showcases, which we can think of as a sample from the population of possible prices.
4319
4320
We can use this sample to estimate the prior distribution of showcase prices.
4321
One way to do that is {\bf kernel density estimation} (KDE), which uses the sample to estimate a smooth distribution.
4322
4323
SciPy provides \py{gaussian_kde}, which takes a sample and returns an object that represents the estimated distribution.
4324
\index{kernel density estimation}
4325
\index{KDE}
4326
4327
The following function takes a sample, makes a KDE, evaluates it at a given sequence of quantities, and returns the result as a normalized \py{Pmf}:
4328
4329
\begin{code}
4330
from scipy.stats import gaussian_kde
4331
4332
def make_kde(qs, sample):
4333
kde = gaussian_kde(sample)
4334
ps = kde(qs)
4335
pmf = Pmf(ps, qs)
4336
pmf.normalize()
4337
return pmf
4338
\end{code}
4339
4340
We can use it to estimate the distribution of total price for Showcase 1:
4341
4342
\begin{code}
4343
qs = np.linspace(0, 80000, 81)
4344
prior1 = make_kde(qs, df['Showcase 1'])
4345
\end{code}
4346
4347
\begin{figure}
4348
% chap08soln.ipynb
4349
\centerline{\includegraphics[width=4in]{figs/fig08-01.pdf}}
4350
\caption{Distribution of total price for Showcase 1}
4351
\label{fig08-01}
4352
\end{figure}
4353
4354
Figure~\ref{fig08-01} shows the estimated distribution.
4355
The most common price is around
4356
\$28,000, but there might be a second mode near \$50,000.
4357
4358
If you were a contestant on the
4359
show, you could use this distribution to quantify your prior belief
4360
about the price of each showcase (before you see the prizes).
4361
4362
4363
here is the PDF of a Gaussian distribution with
4364
mean 0 and standard deviation 1:
4365
%
4366
\[ f(x) = \frac{1}{\sqrt{2 \pi}} \exp(-x^2/2) \]
4367
%
4368
4369
4370
4371
4372
4373
\section{Modeling the contestants}
4374
4375
When the contestants see the prizes, they get information they can use to update their beliefs.
4376
To do that, we have to answer these questions:
4377
4378
\begin{enumerate}
4379
4380
\item What data should we consider and how should we quantify it?
4381
4382
\item Can we compute a likelihood function; that is,
4383
for each hypothetical value of \py{price}, can we compute
4384
the conditional likelihood of the data?
4385
4386
\end{enumerate}
4387
4388
To answer these questions, I model the contestant
4389
as a price-guessing instrument with known error characteristics.
4390
In other words, when the contestant sees the prizes, they
4391
guess the price of each prize---ideally without taking into
4392
consideration the fact that the prize is part of a showcase---and
4393
add up the prices. Let's call this total \py{guess}.
4394
\index{error}
4395
4396
Under this model, the question we have to answer is, ``If the
4397
actual price is \py{price}, what is the likelihood that the
4398
contestant's estimate would be \py{guess}?''
4399
\index{likelihood}
4400
4401
Or if we define \py{error = price - guess}, we can ask, ``What is the likelihood that the contestant's estimate is off by \py{error}?''
4402
4403
To answer this question, I'll use the historical data again.
4404
For each showcase in the dataset, let's look at the difference between the contestant's bid and the actual price:
4405
4406
\begin{code}
4407
sample_diff1 = df['Bid 1'] - df['Showcase 1']
4408
sample_diff2 = df['Bid 2'] - df['Showcase 2']
4409
\end{code}
4410
4411
To visualize the distribution of these differences, we can use KDE again.
4412
4413
\begin{code}
4414
qs = np.linspace(-40000, 20000, 61)
4415
kde_diff1 = make_kde(qs, sample_diff1)
4416
kde_diff2 = make_kde(qs, sample_diff2)
4417
\end{code}
4418
4419
\begin{figure}
4420
% chap08soln.ipynb
4421
\centerline{\includegraphics[width=4in]{figs/fig08-02.pdf}}
4422
\caption{Distribution of differences for the two contestants.}
4423
\label{fig08-02}
4424
\end{figure}
4425
4426
Figure~\ref{fig08-02} shows the results.
4427
4428
It looks like the bids are too low more often than too high, which makes sense.
4429
Remember that under the rules of the game, you lose if you overbid, so contestants probably underbid to some degree deliberately.
4430
4431
We can use the observed distribution of differences to model the contestant's distribution of errors.
4432
This step is a little tricky because we don't actually know the contestant's guesses; we only know what they bid.
4433
So we have to make some assumptions:
4434
4435
\begin{enumerate}
4436
4437
\item I'll assume that contestants underbid because they are being strategic, and that on average their guesses are accurate. In other words, the mean of their errors is 0.
4438
4439
\item But I'll assume that the spread of the differences reflects the actual spread of their errors. So, I'll use the standard deviation of the differences as the standard deviation of their errors.
4440
4441
\end{enumerate}
4442
4443
Based on these assumptions, I'll make a normal distribution with mean 0 and standard deviation \py{std_diff1}:
4444
4445
\begin{code}
4446
from scipy.stats import norm
4447
4448
error_dist1 = norm(0, std_diff1)
4449
\end{code}
4450
4451
The result is an object that represents the distribution of errors for Player 1.
4452
Among other things, this object can compute the PDF of a normal distribution, which we will use in the next section.
4453
4454
\index{normal distribution}
4455
4456
This model is not perfect because contestants' bids are sometimes strategic; for example, if Player 2 thinks that Player 1
4457
has overbid, Player 2 might make a very low bid.
4458
In that case \py{diff} does not reflect \py{error}.
4459
If this happens a lot, the observed variance in \py{diff} might overestimate the variance in \py{error}.
4460
Nevertheless, I think it is a reasonable modeling decision.
4461
4462
As an alternative, someone preparing to appear on the show could
4463
estimate their own distribution of \py{error} by watching previous shows
4464
and recording their guesses and the actual prices.
4465
4466
4467
\section{Update}
4468
4469
Now we are ready to do the update.
4470
4471
Suppose you are Player 1. You see the prizes in your showcase and your estimate of the total price is \$23,000.
4472
4473
For each hypothetical price in the prior distribution, I'll subtract away your guess.
4474
The result is your error under each hypothesis.
4475
4476
\begin{code}
4477
guess1 = 23000
4478
qs = prior1.index
4479
error1 = guess1 - qs
4480
\end{code}
4481
4482
Now suppose you know based on past performance that your estimation error is well modeled by \py{error_dist1}.
4483
4484
Under that assumption we can compute the likelihood of your estimate under each hypothesis.
4485
4486
\begin{code}
4487
likelihood1 = error_dist1.pdf(error1)
4488
\end{code}
4489
4490
And we can use that likelihood to update the prior.
4491
4492
\begin{code}
4493
posterior1 = prior1 * likelihood1
4494
posterior1.normalize()
4495
\end{code}
4496
4497
Figure~\ref{fig08-03} shows this posterior distribution along with the prior.
4498
Because your estimate is in the lower end of the range, the posterior distribution has shifted to the left.
4499
4500
\begin{figure}
4501
% chap08soln.ipynb
4502
\centerline{\includegraphics[width=4in]{figs/fig08-03.pdf}}
4503
\caption{Prior and posterior distributions for Player 1.}
4504
\label{fig08-03}
4505
\end{figure}
4506
4507
Based on the prior mean, before you saw the prizes you expected to see a showcase with a value close to \$30,000.
4508
4509
After making an estimate of \$23,000, you updated the prior distribution.
4510
Based on the combination of the prior and your estimate, you now expect the actual price to be about \$26,000.
4511
4512
On one level, this result makes sense.
4513
The posterior mean is near the midpoint of your estimate and the prior mean.
4514
4515
On another level, you might find this result strange because it
4516
suggests that if you {\em think} the price is \$23,000, then you
4517
should {\em believe} the price is \$26,000.
4518
4519
To resolve this apparent paradox, remember that you are combining two
4520
sources of information, historical data about past showcases and
4521
guesses about the prizes you see.
4522
4523
We are treating the historical data as the prior and updating it
4524
based on your guesses, but we could equivalently use your guess
4525
as a prior and update it based on historical data.
4526
4527
If you think of it that way, maybe it is less surprising that the
4528
most likely value in the posterior is not your original guess.
4529
4530
\section{Strategy}
4531
4532
Now that we have a posterior distribution, let's think about strategy.
4533
4534
%TODO: Outline of the sections that follow
4535
4536
4537
4538
\section{Probability of Winning}
4539
4540
First, from the point of view of Player 1, let's compute the probability that Player 2 overbids.
4541
To keep it simple, I'll use only the performance of past players, ignoring the estimated price of the showcase.
4542
4543
The following function takes a sequence of past bids and returns the fraction that overbid.
4544
4545
\begin{code}
4546
def prob_overbid(sample_diff):
4547
return np.mean(sample_diff > 0)
4548
\end{code}
4549
4550
In the dataset, Player 2 overbids about 30\% of the time.
4551
4552
Now suppose Player 1 underbids by \$5000.
4553
What is the probability that Player 2 underbids by more?
4554
4555
The following function uses past performance to estimate the probability that a player underbids by more than a given amount, \py{diff}:
4556
4557
\begin{code}
4558
def prob_worse_than(diff, sample_diff):
4559
return np.mean(sample_diff < diff)
4560
\end{code}
4561
4562
Player 2 underbids by more than \$5000 about 40\% of the time.
4563
4564
We can combine these functions to compute the probability that Player 1 wins, given the difference between their bid and the actual price:
4565
4566
\begin{code}
4567
def compute_prob_win(diff, sample_diff):
4568
# if you overbid you lose
4569
if diff > 0:
4570
return 0
4571
4572
# if the opponent overbids, you win
4573
p1 = prob_overbid(sample_diff)
4574
4575
# or of their bid is worse than yours, you win
4576
p2 = prob_worse_than(diff, sample_diff)
4577
return p1 + p2
4578
\end{code}
4579
4580
Let's look at this from your point of view as a contestant.
4581
\py{diff} is the difference between your bid and the actual price; if it's greater than 0, you overbid, so you lose.
4582
4583
\py{sample_diff} is a sample of differences for your opponent.
4584
If they overbid (and you didn't) you win.
4585
4586
Otherwise, we have to see whose bid is closer, yours or your opponent's. If their bid is worse than yours, you win.
4587
4588
As an example, you can call it like this:
4589
4590
\begin{code}
4591
compute_prob_win(-5000, sample_diff2)
4592
\end{code}
4593
4594
If Player 1 underbids by \$5000, their chance of winning is about 67\%.
4595
Now let's look at the probability of winning for a range of possible differences.
4596
4597
\begin{code}
4598
xs = np.linspace(-30000, 5000, 121)
4599
ys = [compute_prob_win(x, sample_diff2) for x in xs]
4600
\end{code}
4601
4602
From the point of view of Player 1, Figure~\ref{fig08-04} shows the probability of winning as a function of the difference between their bid and the actual price.
4603
4604
\begin{figure}
4605
% chap08soln.ipynb
4606
\centerline{\includegraphics[width=4in]{figs/fig08-04.pdf}}
4607
\caption{For Player 1, the probability of winning as a function of the difference between their bid and the actual price.}
4608
\label{fig08-04}
4609
\end{figure}
4610
4611
4612
\section{Decision Analysis}
4613
4614
In the previous section we computed the probability of winning given that we have underbid by a particular amount.
4615
4616
In reality the contestants don't know how much they have underbid by because they don't know the actual price.
4617
4618
But they do have a posterior distribution that represents their beliefs about the actual price, and they can use that to estimate their probability of winning with a given bid.
4619
4620
The following function take a possible bid, a posterior distribution of actual prices, and a sample of differences for the opponent.
4621
4622
\begin{code}
4623
def total_prob_win(bid, posterior, sample_diff):
4624
total = 0
4625
for price, prob in posterior.items():
4626
diff = bid - price
4627
total += prob * compute_prob_win(diff, sample_diff)
4628
return total
4629
\end{code}
4630
4631
It loops through the hypothetical prices in the posterior distribution and for each price:
4632
4633
\begin{enumerate}
4634
4635
\item Computes the difference between the bid and the hypothetical price.
4636
4637
\item Computes the probability that the player wins, given that difference.
4638
4639
\item Adds up the weighted sum of the probabilities, where the weights are the probabilities in the posterior distribution.
4640
4641
\end{enumerate}
4642
4643
This loop implements the law of total probability:
4644
4645
\[ \p{win} = \sum_{price} \p{price} ~ \p{win ~|~ price} \]
4646
4647
Now we can loop through a range of possible bids and compute the probability of winning:
4648
4649
\begin{code}
4650
bids = posterior1.index
4651
probs = [total_prob_win(bid, posterior1, sample_diff2)
4652
for bid in bids]
4653
\end{code}
4654
4655
For Player 1, Figure~\ref{fig08-05} shows the probability of winning as a function of their bid.
4656
4657
\begin{figure}
4658
% chap08soln.ipynb
4659
\centerline{\includegraphics[width=4in]{figs/fig08-05.pdf}}
4660
\caption{For Player 1, the probability of winning as a function of their bid.}
4661
\label{fig08-05}
4662
\end{figure}
4663
4664
Recall that your estimate was \$23,000.
4665
4666
After using your estimate to compute the posterior distribution, the posterior mean is about \$26,000.
4667
4668
But the bid that maximizes your chance of winning is \$21,000; with that bid, the probability of winning is 52\%.
4669
4670
4671
\section{Expected Gain}
4672
4673
In the previous section we computed the bid that maximizes your chance of winning.
4674
And if that's your goal, the bid we computed is optimal.
4675
4676
But winning isn't everything.
4677
Remember that if your bid is off by \$250 or less, you win both showcases.
4678
So it might be a good idea to increase your bid a little: it increases the chance you overbid and lose, but it also increases the chance of winning both showcases.
4679
4680
Let's see how that works out.
4681
The following function computes how much you will win, on average, given your bid, the actual price, and a sample of errors for your opponent.
4682
4683
\begin{code}
4684
def compute_gain(bid, price, sample_diff):
4685
diff = bid - price
4686
prob = compute_prob_win(diff, sample_diff)
4687
4688
# if you are within 250 dollars, you win both showcases
4689
if -250 <= diff <= 0:
4690
return 2 * price * prob
4691
else:
4692
return price * prob
4693
\end{code}
4694
4695
For simplicity, I assume that both showcases have the same value.
4696
Since the probability of winning both showcases is small, the the effect of this simplification should be small.
4697
4698
As an example, if the actual price is \$35000
4699
and you bid \$30000,
4700
you will win about \$23,600 worth of prizes on average.
4701
4702
In reality we don't know the actual price, but we have a posterior distribution that represents what we know about it.
4703
By averaging over the prices and probabilities in the posterior distribution, we can compute the {\bf expected gain} for a particular bid.
4704
4705
\begin{code}
4706
def expected_gain(bid, posterior, sample_diff):
4707
total = 0
4708
for price, prob in posterior.items():
4709
total += prob * compute_gain(bid, price, sample_diff)
4710
return total
4711
\end{code}
4712
4713
The first argument is your bid; the second is the posterior distribution that represents your belief about the price of the showcase; and \py{sample_diff} is a sample of differences for your opponent.
4714
4715
For the posterior we computed earlier, based on an estimate of \$23,000,
4716
the expected gain for a bid of \$21,000
4717
is about \$16,900.
4718
4719
But can we do better?
4720
To find out, we can loop through a range of bids and find the one that maximizes expected gain.
4721
4722
\begin{code}
4723
bids = posterior1.index
4724
4725
gains = [expected_gain(bid, posterior1, sample_diff2) for bid in bids]
4726
4727
expected_gain_series = pd.Series(gains, index=bids)
4728
\end{code}
4729
4730
Figure~\ref{fig08-06} shows expected gain for a range of possible bids.
4731
4732
\begin{figure}
4733
% chap08soln.ipynb
4734
\centerline{\includegraphics[width=4in]{figs/fig08-06.pdf}}
4735
\caption{Expected gain for a range of possible bids.}
4736
\label{fig08-06}
4737
\end{figure}
4738
4739
Recall that the estimated value of the prizes is \$23,000 and the bid that maximizes the chance of winning is \$21,000.
4740
The bid that maximizes your expected gain is \$22,000; with that bid, your expected gain is about \$17,400.
4741
4742
4743
\section{Discussion}
4744
4745
One of the features of Bayesian estimation is that the
4746
result comes in the form of a posterior distribution. Classical
4747
estimation usually generates a single point estimate or a confidence
4748
interval, which is sufficient if estimation is the last step in the
4749
process, but if you want to use an estimate as an input to a
4750
subsequent analysis, point estimates and intervals are often not much
4751
help.
4752
\index{distribution}
4753
4754
In this example, we use the posterior distribution
4755
to compute an optimal bid. The return on a given bid is asymmetric
4756
and discontinuous (if you overbid, you lose), so it would be hard to
4757
solve this problem analytically. But it is relatively simple to do
4758
computationally.
4759
\index{decision analysis}
4760
4761
Newcomers to Bayesian thinking are often tempted to summarize the
4762
posterior distribution by computing the mean or the maximum
4763
likelihood estimate. These summaries can be useful, but if that's
4764
all you need, then you probably don't need Bayesian methods in the
4765
first place.
4766
\index{maximum likelihood}
4767
\index{summary statistic}
4768
4769
Bayesian methods are most useful when you can carry the posterior
4770
distribution into the next step of the analysis to perform some
4771
kind of decision analysis, as we did in this chapter, or some kind of
4772
prediction, as we see in the next chapter.
4773
4774
\section{Exercises}
4775
4776
The code for this chapter is in \py{chap08.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
4777
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap08.ipynb}.
4778
4779
The notebook provides space where you can work on the following problems.
4780
4781
\begin{exercise}
4782
Following the instructions in the notebook, replicate the analysis in this chapter from the point of view of Player 2.
4783
\end{exercise}
4784
4785
\begin{exercise}
4786
4787
This exercise is inspired by a true story. In 2001 I created Green Tea Press to publish my books, starting with {\tt Think Python}.
4788
I ordered 100 copies from a short-run printer and made the book available for through a distributor. After the first week, the distributor reported that 12 copies were sold. Based that report, I thought I would run out of copies in about 8 weeks, so I got ready to order more. My printer offered me a discount if I ordered more than 1000 copies, so I went a little crazy and ordered 2000 copies. A few days later, my mother called to tell me that her copies of the book had arrived. Surprised, I asked how many ``copies''. She said ten.
4789
4790
It turned out I had sold only two copies to non-relatives. And it took a lot longer than I expected to sell 2000 copies.
4791
4792
The details of this story are unique, but the general problem is something almost every retailer has to figure out. Based on past sales, how do you predict future sales? And based on those predictions, how do you decide how much to order and when?
4793
4794
Often the cost of a bad decision is complicated. If you place a lot of small orders rather than one big one, your costs are likely to be higher. If you run out of inventory, you might lose customers. And if you order too much, you have to pay the various costs of holding inventory.
4795
4796
So, let's solve a version of the problem I faced. Suppose you start selling books online. During the first week you sell 12 copies (and let's assume that none of the customers are your mother). During the second week you sell 8 copies.
4797
4798
Assuming that the arrival of orders is a Poisson process, we can think of the weekly orders as samples from a Poisson distribution with an unknown rate.
4799
Choose a prior you think is appropriate and use the data to compute the posterior distribution of the order rate.
4800
Then generate a posterior predictive distribution for the number of copies you expect during the next 8 weeks.
4801
4802
\begin{itemize}
4803
4804
\item Suppose the cost of printing the book is \$5 per copy,
4805
4806
\item But if you order 100 or more, it's \$4.50 per copy.
4807
4808
\item For every book you sell, you get \$10.
4809
4810
\item But if you run out of books before the end of 8 weeks, you lose \$50 in future sales for every week you are out of stock.
4811
4812
\item If you have books left over at the end of 8 weeks, you lose \$2 in inventory costs per extra book.
4813
\end{itemize}
4814
4815
For example, suppose you get orders for 10 books per week, every week.
4816
4817
If you order 60 books,
4818
\begin{itemize}
4819
4820
\item The total cost is \$300.
4821
4822
\item You sell all 80 books, so you make \$600.
4823
4824
\item But the book is out of stock for two weeks, so you lose \$100 in future sales.
4825
\end{itemize}
4826
4827
In total, your profit is \$200.
4828
4829
If you order 100 books,
4830
\begin{itemize}
4831
4832
\item The total cost is \$450.
4833
4834
\item You sell 80 books, again, so you make \$800.
4835
4836
\item But you have 20 books left over at the end, so you lose \$40.
4837
\end{itemize}
4838
4839
In total, your profit is \$310.
4840
4841
Combining these costs with your predictive distribution, how many books should you order to maximize your expected profit?
4842
4843
In the notebook for this chapter, I provide some code to get you started.
4844
4845
\end{exercise}
4846
4847
4848
4849
4850
\chapter{Comparisons}
4851
\label{comparison}
4852
4853
The Elo rating system is a way to quantify the skill of players for games like chess (see \url{https://en.wikipedia.org/wiki/Elo_rating_system}).
4854
4855
It is based on a model of the relationship between the ratings of players and the outcome of a game.
4856
Specifically, if $R_A$ is the rating of player \py{A} and $R_B$ is the rating of player \py{B}, the probability that \py{A} beats \py{B} is given by the logistic function (see \url{https://en.wikipedia.org/wiki/Logistic_function}):
4857
4858
$\p{\mathrm{A~beats~B}} = 1 / (1 + 10^{(R_B-R_A)/400})$
4859
4860
The parameters $10$ and $400$ are arbitrary choices that determine the range of the ratings. In chess, the range is from 100 to 2800.
4861
4862
Suppose \py{A} has a current rating of 1600 and \py{B} has a current rating of 1800.
4863
Then \py{A} and \py{B} play and \py{A} wins. How should we update their ratings?
4864
4865
In this chapter I will solve a simpler version of this question; then you will have a chance to finish it off as an exercise.
4866
4867
This chapter introduces {\tt joint distributions}, which represent the distributions of two or more variables and the relationships among them.
4868
4869
We'll extend the Bayesian update process we've seen in previous chapter and apply it to a joint distribution.
4870
4871
But first I will introduce a tool we will use to construct joint distributions and compute likelihoods: outer operations.
4872
4873
4874
\section{Outer operations}
4875
\label{outer-operations}
4876
4877
Many useful operations can be expressed in the form of an {\bf outer operation} of two sequences.
4878
Suppose you have sequences like \py{t1} and \py{t2}:
4879
4880
\begin{code}
4881
t1 = [1,3,5]
4882
t2 = [2,4]
4883
\end{code}
4884
4885
The most common outer operation is the outer product, which computes the product of every pair of values, one from each sequence.
4886
4887
For example, here is the outer product of \py{t1} and \py{t2}:
4888
4889
\begin{code}
4890
a = np.multiply.outer(t1, t2)
4891
\end{code}
4892
4893
The result is a NumPy array, but it's easier to understand what it is if I put it in a DataFrame:
4894
4895
\begin{code}
4896
df = pd.DataFrame(a, index=t1, columns=t2)
4897
\end{code}
4898
4899
Here's the result:
4900
4901
\input{tables/table09-02}
4902
4903
The values from \py{t1} appear along the rows; the values from \py{t2} appear along the columns.
4904
4905
Each element in the array is the product of an element from \py{t1} and an element from \py{t2}.
4906
4907
The outer sum is similar, except that each element is the {\em sum} of an element from \py{t1} and an element from \py{t2}.
4908
4909
\begin{code}
4910
a = np.add.outer(t1, t2)
4911
df = pd.DataFrame(a, index=t1, columns=t2)
4912
\end{code}
4913
4914
Here's the result:
4915
4916
\input{tables/table09-02}
4917
4918
These outer operations work with Python lists and tuples, and NumPy arrays, but not Pandas \py{Series}.
4919
4920
So I'll use the following function, which takes two Pandas \py{Series} and puts the result into a \py{DataFrame}.
4921
4922
\begin{code}
4923
def outer_product(s1, s2):
4924
a = np.multiply.outer(s1.to_numpy(), s2.to_numpy())
4925
return pd.DataFrame(a, index=s1.index, columns=s2.index)
4926
\end{code}
4927
4928
It might not be obvious yet why these operations are useful, but we'll see some examples soon.
4929
4930
With that, we are ready to take on a new Bayesian problem.
4931
4932
\section{How tall is A?}
4933
4934
Suppose I choose two people from the population of adult males in the United States, and call them A and B. If we see that A taller than B, how tall is A?
4935
4936
To answer this question:
4937
4938
\begin{enumerate}
4939
4940
\item I'll use background information about the height of men in the U.S. to form a prior distribution of height,
4941
4942
\item I'll construct a joint distribution of height for A and B (and I'll explain what that is);
4943
4944
\item Then I'll update the prior with the information that A is taller, and
4945
4946
\item From the posterior joint distribution I'll extract the posterior distribution of height for A.
4947
4948
\end{enumerate}
4949
4950
In the U.S. the average height of male adults in 178 cm and the standard deviation is 7.7 cm. The distribution is not exactly normal, because nothing in the real world is, but the normal distribution is a pretty good model of the actual distribution, so we can use it as a prior distribution for A and B.
4951
4952
Here's an array of equally-spaced values from roughly 3 standard deviations below the mean to 3 standard deviations above.
4953
4954
\begin{code}
4955
mean = 178
4956
std = 7.7
4957
qs = np.arange(mean-24, mean+24, 0.5)
4958
\end{code}
4959
4960
SciPy provides a function called \py{norm} that represents a normal distribution with a given mean and standard deviation, and provides \py{pdf}, which evaluates the normal probability distribution function (PDF), which we will use as the prior probabilities.
4961
4962
\begin{code}
4963
from scipy.stats import norm
4964
ps = norm(mean, std).pdf(qs)
4965
\end{code}
4966
4967
I'll store the \py{ps} and \py{qs} in a \py{Pmf} that represents the prior distribution.
4968
4969
\begin{code}
4970
prior = Pmf(ps, qs)
4971
prior.normalize()
4972
\end{code}
4973
4974
This distribution represents what we believe about the heights of \py{A} and \py{B} before we take into account the data that \py{A} is taller.
4975
4976
4977
\section{Joint distribution}
4978
4979
The next step is to construct a distribution that represents the probability of every pair of heights, which is called a joint distribution.
4980
The elements of the joint distribution are
4981
4982
$\p{A_y~\mathrm{and}~B_x}$
4983
4984
which is the probability that \py{A} is $y$ cm tall and \py{B} is $x$ cm tall, for all values of $y$ and $x$.
4985
4986
At this point all we know about \py{A} and \py{B} is that they are male residents of the U.S., so their heights are independent; that is, knowing the height of \py{A} provides no additional information about the height of \py{B}.
4987
In that case, we can compute the joint probabilities like this:
4988
4989
$\p{A_y~\mathrm{and}~B_x} = \p{A_y}~\p{B_x}$
4990
4991
Each joint probability is the product of one element from the distribution for \py{A} and one element from the distribution for \py{B}.
4992
So we can compute the joint distribution using \py{outer_product}:
4993
4994
\begin{code}
4995
joint = outer_product(prior, prior)
4996
joint.shape
4997
\end{code}
4998
4999
The result is a \py{DataFrame} with possible heights of \py{A} along the rows, heights of \py{B} along the columns, and the joint probabilities as elements.
5000
5001
The following function uses \py{pcolormesh} to plot the joint distribution.
5002
5003
\begin{code}
5004
def plot_joint(joint):
5005
plt.pcolormesh(joint.columns, joint.index, joint)
5006
plt.colorbar()
5007
decorate(ylabel='A height in cm',
5008
xlabel='B height in cm')
5009
\end{code}
5010
5011
Recall that \py{outer_product} puts the values of \py{A} along the rows and the values of \py{B} across the columns.
5012
5013
Figure~\ref{fig09-01} shows the results.
5014
5015
\begin{figure}
5016
% chap09soln.ipynb
5017
\centerline{\includegraphics[width=4in]{figs/fig09-01.pdf}}
5018
\caption{Joint prior distribution of height for A and B.}
5019
\label{fig09-01}
5020
\end{figure}
5021
5022
As you might expect, the probability is highest near the mean and drops off away from the mean.
5023
5024
5025
\section{Likelihood}
5026
5027
Now that we have a joint prior distribution, we can update it with the data, which is that \py{A} is taller than \py{B}.
5028
5029
Each element in the joint distribution represents a hypothesis about the heights of \py{A} and \py{B}; for example:
5030
5031
\begin{enumerate}
5032
5033
\item The element \py{(180, 170)} represents the hypothesis that \py{A} is 180 cm tall and \py{B} is 170 cm tall. Under this hypothesis, the probability that \py{A} is taller than \py{B} is 1.
5034
5035
\item The element \py{(170, 180)} represents the hypothesis that \py{A} is 170 cm tall and \py{B} is 180 cm tall. Under this hypothesis, the probability that \py{A} is taller than \py{B} is 0.
5036
5037
\end{enumerate}
5038
5039
To compute the likelihood of every pair of values, we can extract the quantities from the joint prior, like this:
5040
5041
\begin{code}
5042
Y = joint.index.to_numpy()
5043
X = joint.columns.to_numpy()
5044
\end{code}
5045
5046
And then apply the \py{outer} version of \py{np.subtract}, which computes the difference between every element of \py{Y} (height of \py{A}) and every element of \py{X} (height of \py{B}).
5047
5048
\begin{code}
5049
diff = np.subtract.outer(Y, X)
5050
\end{code}
5051
5052
The result is an array of differences. To compute likelihoods, we use \py{np.where} which puts \py{1} where the \py{diff} is greater than 0 and 0 elsewhere.
5053
5054
\begin{code}
5055
a = np.where(diff>0, 1, 0)
5056
\end{code}
5057
5058
The result is an array of likelihoods, which I will put in a \py{DataFrame} with the values of \py{Y} in the index and the values of \py{X} in the columns.
5059
5060
\begin{code}
5061
likelihood = pd.DataFrame(a, index=Y, columns=X)
5062
\end{code}
5063
5064
Figure~\ref{fig09-02} shows the likelihood that A is taller than B for each hypothetical pair of heights.
5065
5066
\begin{figure}
5067
% chap09soln.ipynb
5068
\centerline{\includegraphics[width=4in]{figs/fig09-02.pdf}}
5069
\caption{Likelihood that A is taller than B for each hypothetical pair of heights.}
5070
\label{fig09-02}
5071
\end{figure}
5072
5073
We have a prior, we have a likelihood, and we are ready for the update.
5074
5075
\section{The update}
5076
5077
As usual, the unnormalized posterior is the product of the prior and the likelihood.
5078
5079
\begin{code}
5080
posterior = joint * likelihood
5081
\end{code}
5082
5083
I'll use the following function to normalize the posterior:
5084
5085
\begin{code}
5086
def normalize(joint):
5087
prob_data = joint.to_numpy().sum()
5088
joint /= prob_data
5089
\end{code}
5090
5091
We have to convert the \py{DataFrame} to a NumPy array before calling \py{sum}. Otherwise, \py{DataFrame.sum} would compute the sums of the columns and return a \py{Series}.
5092
5093
Now we can normalize the posterior:
5094
5095
\begin{code}
5096
normalize(posterior)
5097
\end{code}
5098
5099
Figure~\ref{fig09-03} shows the result.
5100
5101
\begin{figure}
5102
% chap09soln.ipynb
5103
\centerline{\includegraphics[width=4in]{figs/fig09-03.pdf}}
5104
\caption{Joint posterior distribution of height for A and B.}
5105
\label{fig09-03}
5106
\end{figure}
5107
5108
For all hypotheses where \py{A} is not taller than \py{B}, the posterior probability is 0.
5109
5110
5111
\section{The marginals}
5112
\label{marginals}
5113
5114
The joint posterior distribution represents what we believe about the heights of \py{A} and \py{B}, given the prior distributions and the information that \py{A} is taller.
5115
5116
From this joint distribution, we can compute posterior distributions for \py{A} and \py{B}. To see how, let's start with a simpler problem.
5117
5118
Suppose we want to know the probability that \py{B} is 180 cm tall. We can select the column from the joint distribution where \py{X=180}.
5119
5120
\begin{code}
5121
column = posterior[180]
5122
\end{code}
5123
5124
This column contains posterior probabilities for all cases where \py{X=180}; if we add them up, we get the total probability that \py{B} is 180 cm tall.
5125
5126
\begin{code}
5127
column.sum()
5128
\end{code}
5129
5130
Now, to get the posterior distribution of height for \py{B}, we can add up all of the columns, like this:
5131
5132
\begin{code}
5133
column_sums = posterior.sum(axis=0)
5134
\end{code}
5135
5136
The argument \py{axis=0} means we want to sum the elements along the rows; that is, we want to add up the columns.
5137
5138
The result is a \py{Series} that contains every possible height for \py{B} and its probability. In other words, it is the distribution of heights for \py{B}.
5139
5140
We can put it in a \py{Pmf} like this:
5141
5142
\begin{code}
5143
marginal_B = Pmf(column_sums)
5144
\end{code}
5145
5146
When we extract the distribution of a single variable from a joint distribution, the result is called a {\bf marginal distribution}.
5147
The name comes from a common visualization that shows the joint distribution in the middle and the marginal distributions in the margins.
5148
5149
Similarly, we can get the posterior distribution of height for \py{A} by adding up the rows and putting the result in a \py{Pmf}.
5150
5151
\begin{code}
5152
row_sums = posterior.sum(axis=1)
5153
marginal_A = Pmf(row_sums)
5154
\end{code}
5155
5156
The following function takes a joint distribution and an axis number, and returns a marginal distribution.
5157
5158
\begin{code}
5159
def marginal(joint, axis):
5160
return Pmf(joint.sum(axis=axis))
5161
\end{code}
5162
5163
So we can compute the marginal distributions like this.
5164
5165
\begin{code}
5166
marginal_B = marginal(posterior, axis=0)
5167
marginal_A = marginal(posterior, axis=1)
5168
\end{code}
5169
5170
Figure~\ref{fig09-04} shows what they look like.
5171
5172
\begin{figure}
5173
% chap09soln.ipynb
5174
\centerline{\includegraphics[width=4in]{figs/fig09-04.pdf}}
5175
\caption{Prior and posterior distributions for A and B.}
5176
\label{fig09-04}
5177
\end{figure}
5178
5179
As you might expect, the posterior distribution for \py{A} is shifted to the right and the posterior distribution for \py{B} is shifted to the left.
5180
5181
Based on the observation that \py{A} is taller than \py{B}, we are inclined to believe that \py{A} is a little taller than average, and \py{B} is a little shorter.
5182
5183
Notice that the posterior distributions are a little narrower than the prior.
5184
The standard deviations of the posterior distributions are a little smaller, which means we are a little more certain about the heights of \py{A} and \py{B} after we compare them.
5185
5186
5187
\section{Conditional posteriors}
5188
5189
Now suppose we measure \py{B} and find that he is 180 cm tall. What does that tell us about \py{A}?
5190
5191
In the joint distribution, each column corresponds a possible height for \py{B}. We can select the column that corresponds to height 180 cm like this:
5192
5193
\begin{code}
5194
column_180 = posterior[180]
5195
\end{code}
5196
5197
The result is a \py{Series} that represents possible heights for \py{A} and their relative likelihoods.
5198
These likelihoods are not normalized, but we can normalize them like this:
5199
5200
\begin{code}
5201
cond_A = Pmf(column_180)
5202
cond_A.normalize()
5203
\end{code}
5204
5205
The result is the {\bf conditional distribution} of height for \py{A} given that \py{B} is 180 cm tall.
5206
Figure~\ref{fig09-05} shows what it looks like.
5207
5208
Note that when we make a \py{Pmf} it copies the data by default, so we can modify \py{cond_A} without affecting \py{column_180} or \py{posterior}.
5209
5210
\begin{figure}
5211
% chap09soln.ipynb
5212
\centerline{\includegraphics[width=4in]{figs/fig09-05.pdf}}
5213
\caption{.}
5214
\label{fig09-05}
5215
\end{figure}
5216
5217
The conditional distribution is cut off at 180 cm, because we have established that \py{A} is taller than \py{B} and \py{B} is 180 cm.
5218
5219
\section{Dependence and independence}
5220
5221
When we constructed the joint prior distribution, I said that the heights of \py{A} and \py{B} were independent, which means that knowing one of them provides no information about the other.
5222
In other words, the conditional probability $\p{A_y | B_x}$ is the same as the unconditioned probability $\p{A_y}$.
5223
5224
That's why we can compute an element of the joint prior, $\p{A_y~\mathrm{and}~B_x}$, by rewriting it in terms of conditional probability, $\p{B_x}~\p{A_y~|~B_x}$, and using the independence of $A$ and $B$ to replace the conditional probability.
5225
5226
Putting it all together, we have
5227
5228
$\p{A_y~\mathrm{and}~B_x} = \p{B_x}~\p{A_y}$
5229
5230
But remember, that's only true if $A$ and $B$ are independent.
5231
In the posterior distribution, they are not.
5232
We know that \py{A} is taller than \py{B}, so if we know how tall \py{B} is, that gives us information about \py{A}.
5233
5234
The conditional distribution we just computed demonstrates this dependence.
5235
5236
5237
\section{Summary}
5238
5239
In this chapter I started with the ``outer'' operations, like outer product, which we used to construct a joint distribution.
5240
5241
In general, you cannot construct a joint distribution from two marginal distributions, but in the special case where the distributions are independent, you can.
5242
5243
We extended the Bayesian update process we've seen in previous chapters and applied it to a joint distribution. Then from the posterior joint distribution we extracted posterior marginal distributions and posterior conditional distributions.
5244
5245
As an exercise, you'll have a chance to apply the same process to a slightly more difficult problem, updating Elo ratings based on the outcome of a chess game.
5246
5247
5248
\section{Exercises}
5249
5250
The code for this chapter is in \py{chap09.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
5251
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap09.ipynb}.
5252
5253
The notebook provides space where you can work on the following problems.
5254
5255
\begin{exercise}
5256
Based on the results of the previous example, compute the posterior conditional distribution for \py{B} given that \py{A} is 190 cm.
5257
\end{exercise}
5258
5259
5260
\begin{exercise}
5261
Suppose we have established that \py{A} is taller than \py{B}, but we don't know how tall \py{B} is.
5262
Now we choose a random woman, \py{C}, and find that she is shorter than \py{A} by at least 15 cm. Compute posterior distributions for the heights of \py{A} and \py{C}.
5263
5264
The average height for women in the U.S. is 163 cm; the standard deviation is 7.3 cm.
5265
\end{exercise}
5266
5267
5268
\begin{exercise}
5269
At the beginning of this chapter, I introduced
5270
the Elo rating system, which is used to quantify the skill level of players for games like chess.
5271
5272
It is based on a model of the relationship between the ratings of players and the outcome of a game. Specifically, if $R_A$ is the rating of player \py{A} and $R_B$ is the rating of player \py{B}, the probability that \py{A} beats \py{B} is given by the logistic function:
5273
5274
$\p{\mathrm{A~beats~B}} = 1 / (1 + 10^{(R_B-R_A)/400})$
5275
5276
Suppose \py{A} has a current rating of 1600, but we are not sure it is accurate. We could describe their true rating with a normal distribution with mean 1600 and standard deviation 100, to indicate our uncertainty.
5277
5278
And suppose \py{B} has a current rating of 1800, with the same level of uncertainty.
5279
5280
Then \py{A} and \py{B} play and \py{A} wins. How should we update their ratings?
5281
5282
To answer this question:
5283
5284
\begin{enumerate}
5285
5286
\item Construct prior distributions for \py{A} and \py{B}.
5287
5288
\item Use them to construct a joint distribution, assuming that the prior distributions are independent.
5289
5290
\item Use the logistic function above to compute the likelihood of the outcome under each joint hypothesis.
5291
5292
\item Use the joint prior and likelihood to compute the joint posterior.
5293
5294
\item Extract and plot the marginal posteriors for \py{A} and \py{B}.
5295
5296
\item Compute the posterior means for \py{A} and \py{B}. How much should their ratings change based on this outcome?
5297
5298
\end{enumerate}
5299
5300
\end{exercise}
5301
5302
5303
5304
\chapter{Classification}
5305
\label{classification}
5306
5307
5308
Classification might be the most well-known application of Bayesian
5309
methods, made famous as the basis of the first generation of spam
5310
filters in the 1990s (see \url{https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering}).
5311
5312
In this chapter, I'll demonstrate Bayesian classification using data
5313
collected and made available by Dr.~Kristen Gorman at the Palmer
5314
Long-Term Ecological Research Station in Antarctica. We'll use this data
5315
to classify penguins by species.
5316
5317
This dataset was published to support this article: Gorman, Williams,
5318
and Fraser, ``Ecological
5319
Sexual Dimorphism and Environmental Variability within a Community of
5320
Antarctic Penguins (Genus \emph{Pygoscelis})'', March 2014, which you can read at \url{https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081}.
5321
5322
The dataset contains one row for each penguin and one column for each
5323
variable, including the measurements we will use for classification.
5324
We can read it into a \py{DataFrame} like this:
5325
5326
\begin{code}
5327
df = pd.read_csv('penguins_raw.csv')
5328
\end{code}
5329
5330
Three species of penguins are represented in the dataset: Adelie,
5331
Chinstrap and Gentoo.
5332
The measurements we'll use to classify them are:
5333
5334
\begin{itemize}
5335
\item
5336
Body Mass in grams (g).
5337
\item
5338
Flipper Length in millimeters (mm).
5339
\item
5340
Culmen Length in millimeters.
5341
\item
5342
Culmen Depth in millimeters.
5343
\end{itemize}
5344
5345
If you are not familiar with the word ``culmen'', it refers to the
5346
top margin of the beak (see \url{https://en.wikipedia.org/wiki/Bird_measurement\#Culmen}).
5347
5348
5349
\section{Distributions of measurements}
5350
\label{distributions-of-measurements}
5351
5352
These measurements will be most useful for classification if there are
5353
substantial differences between species and small variation within
5354
species. To see whether that is true, and to what degree, I will plot
5355
cumulative distribution functions (CDFs) of each measurement for each
5356
species.
5357
5358
The following function takes the \py{DataFrame} and
5359
a column name, and returns a dictionary that maps from each species name
5360
to a \py{Cdf} of the values in the given column.
5361
5362
\begin{code}
5363
def make_cdf_map(df, varname, by='Species2'):
5364
cdf_map = {}
5365
grouped = df.groupby(by)[varname]
5366
for species, group in grouped:
5367
cdf_map[species] = Cdf.from_seq(group, name=species)
5368
return cdf_map
5369
\end{code}
5370
5371
Figure~\ref{fig10-01} shows
5372
5373
\begin{figure}
5374
% chap01soln.ipynb
5375
\centerline{\includegraphics[width=5.5in]{figs/fig10-01.pdf}}
5376
\caption{}
5377
\label{fig10-01}
5378
\end{figure}
5379
5380
It looks like we can use culmen length to identify Adelie penguins, but
5381
the distributions for the other two species almost entirely overlap.
5382
5383
Using flipper length, we can distinguish Gentoo penguins from the other
5384
two species. So with just these two features, it seems like we should be
5385
able to classify penguins with some accuracy.
5386
5387
Culmen depth and body mass distinguish Gentoo penguins from the other
5388
two species, but these features might not add a lot of additional
5389
information, beyond flipper length and culmen length.
5390
5391
All of these CDFs show the sigmoid shape characteristic of the normal
5392
distribution; I will take advantage of that observation in the next
5393
section.
5394
5395
\section{Normal models}
5396
\label{normal-models}
5397
5398
Now let's use these features to classify penguins. I'll proceed in the
5399
usual Bayesian way:
5400
5401
\begin{enumerate}
5402
5403
\item
5404
I'll define a prior distribution that represents a hypothesis for each
5405
species and a prior probability.
5406
\item
5407
I'll compute the likelihood of the data under each hypothesis, and
5408
then
5409
\item
5410
Compute the posterior probability of each hypothetical species.
5411
\end{enumerate}
5412
5413
To compute the likelihood of the data under each hypothesis, I will use
5414
the data to estimate the parameters of a normal distribution for each
5415
feature and each species.
5416
5417
The following function takes a \py{DataFrame} and a
5418
column name; it returns a dictionary that maps from each species name to
5419
a \py{norm} object. \py{norm}
5420
is defined in SciPy; it represents a normal distribution with a given
5421
mean and standard deviation.
5422
5423
\begin{code}
5424
from scipy.stats import norm
5425
5426
def make_norm_map(df, varname, by='Species2'):
5427
norm_map = {}
5428
grouped = df.groupby(by)[varname]
5429
for species, group in grouped:
5430
mean = group.mean()
5431
std = group.std()
5432
norm_map[species] = norm(mean, std)
5433
return norm_map
5434
\end{code}
5435
5436
For example, here's how we estimate the distributions of flipper length
5437
for the three species.
5438
5439
\begin{code}
5440
flipper_map = make_norm_map(df, 'Flipper Length (mm)')
5441
\end{code}
5442
5443
As usual I will use a \py{Pmf} to represent the
5444
prior distribution. For simplicity, I'll assume that the three species
5445
are equally likely.
5446
5447
\begin{code}
5448
hypos = flipper_map.keys()
5449
prior = Pmf(1/3, hypos)
5450
prior
5451
\end{code}
5452
5453
Now suppose we measure a penguin and find that its flipper is 210 cm.
5454
What is the probability of that measurement under each hypothesis?
5455
5456
The \py{norm} object provides
5457
\py{pdf}, which computes the probability density
5458
function (PDF) of the normal distribution. We can use it to compute the
5459
likelihood of the observed data in a given distribution.
5460
5461
\begin{code}
5462
data = 210
5463
flipper_map['Adelie'].pdf(data)
5464
\end{code}
5465
5466
The result is a probability density, so we can't interpret it as a
5467
probability. But it is proportional to the likelihood of the data, so we
5468
can use it to update the prior.
5469
5470
Here's how we compute the likelihood of the data in each distribution.
5471
5472
\begin{code}
5473
likelihood = [flipper_map[hypo].pdf(data) for hypo in hypos]
5474
\end{code}
5475
5476
Now we can do the update in the usual way.
5477
5478
\begin{code}
5479
posterior = prior * likelihood
5480
posterior.normalize()
5481
\end{code}
5482
5483
And here are the results:
5484
5485
\input{tables/table10-01}
5486
5487
A penguin with a 210 mm flipper has an 80\% chance of being a Gentoo and
5488
about an 19\% chance of being a Chinstrap (assuming that the three
5489
species were equally likely before the measurement).
5490
5491
The following function encapsulates the steps we just ran. It takes a
5492
\py{Pmf} representing the prior distribution, the
5493
observed data, and a map from each hypothesis to the distribution of the
5494
feature.
5495
5496
\begin{code}
5497
def update_penguin(prior, data, norm_map):
5498
hypos = prior.qs
5499
likelihood = [norm_map[hypo].pdf(data) for hypo in hypos]
5500
posterior = prior * likelihood
5501
posterior.normalize()
5502
return posterior
5503
\end{code}
5504
5505
The return value is the posterior distribution.
5506
5507
As we saw in the CDFs, flipper length does not distinguish strongly
5508
between Adelie and Chinstrap penguins. If a penguin has a 190 mm
5509
flipper, it is almost certainly not a Gentoo, but it is almost equally
5510
likely to be Adelie or Chinstrap.
5511
5512
\begin{code}
5513
posterior2 = update_penguin(prior, 190, flipper_map)
5514
\end{code}
5515
5516
But culmen length \emph{can} make this distinction. We can estimate
5517
distributions of culmen length for each species like this:
5518
5519
\begin{code}
5520
culmen_map = make_norm_map(df, 'Culmen Length (mm)')
5521
\end{code}
5522
5523
A penguin with culmen length 38 mm is almost certainly an Adelie.
5524
5525
\begin{code}
5526
posterior3 = update_penguin(prior, 38, culmen_map)
5527
\end{code}
5528
5529
With culmen length 48 mm, it is probably not an Adelie, but it's about
5530
equally likely to be a Chinstrap or Gentoo.
5531
5532
\begin{code}
5533
posterior4 = update_penguin(prior, 48, culmen_map)
5534
\end{code}
5535
5536
Using one feature at a time, sometimes we can classify penguins with
5537
high confidence; sometimes we can't. We can do better using multiple
5538
features.
5539
5540
\section{Naive Bayesian classification}
5541
\label{naive-bayesian-classification}
5542
5543
To make it easier to do multiple updates, I'll use the following
5544
function, which takes a prior \py{Pmf}, sequence of
5545
measurements and a corresponding sequence of dictionaries containing
5546
estimated distributions.
5547
5548
\begin{code}
5549
def update_naive(prior, data_seq, norm_maps):
5550
posterior = prior.copy()
5551
for data, norm_map in zip(data_seq, norm_maps):
5552
posterior = update_penguin(posterior, data, norm_map)
5553
return posterior
5554
\end{code}
5555
5556
The return value is a posterior \py{Pmf}.
5557
5558
I'll use the same features we looked at in the previous section: culmen
5559
length and flipper length.
5560
5561
\begin{code}
5562
varnames = ['Culmen Length (mm)', 'Flipper Length (mm)']
5563
norm_maps = [culmen_map, flipper_map]
5564
\end{code}
5565
5566
Now suppose we find a penguin with culmen length 48 mm and flipper
5567
length 210 mm. Here's the update:
5568
5569
\begin{code}
5570
data_seq = 48, 210
5571
posterior = update_naive(prior, data_seq, norm_maps)
5572
\end{code}
5573
5574
It's most likely to be a Gentoo.
5575
5576
I'll loop through the dataset and classify each penguin with these two
5577
features.
5578
5579
\begin{code}
5580
df['Classification'] = np.nan
5581
for i, row in df.iterrows():
5582
data_seq = row[varnames]
5583
posterior = update_naive(prior, data_seq, norm_maps)
5584
df.loc[i, 'Classification'] = posterior.max_prob()
5585
\end{code}
5586
5587
The result is a new column in the \py{DataFrame}.
5588
So let's see how many we got right.
5589
5590
There are 344 penguins in the dataset, but two of them are missing
5591
measurements, so we have 342 valid cases.
5592
Of those, 324 are classified correctly, which is almost 95\%.
5593
5594
The classifier we used in this section is called ``naive'' because it
5595
ignores correlations between the features. To see why that matters, I'll
5596
make a less naive classifier: one that takes into account the joint
5597
distribution of the features.
5598
5599
\section{Joint distributions}
5600
\label{joint-distributions}
5601
5602
Let's see what the joint distribution looks like.
5603
I'll start by making a scatter plot of the data.
5604
5605
\begin{code}
5606
def scatterplot(df, var1, var2):
5607
grouped = df.groupby('Species2')
5608
for species, group in grouped:
5609
plt.plot(group[var2], group[var1], 'o',
5610
alpha=0.4, label=species)
5611
5612
decorate(ylabel=var1, xlabel=var2)
5613
\end{code}
5614
5615
Figure~\ref{fig01-02} shows a scatter plot of culmen length and flipper length for the three
5616
species.
5617
5618
\begin{figure}
5619
\centerline{\includegraphics[width=4in]{figs/fig10-02.pdf}}
5620
\caption{}
5621
\label{fig01-02}
5622
\end{figure}
5623
5624
Within each species, there is a clear correlation between culmen length
5625
and flipper length.
5626
5627
If we ignore these correlations, we are assuming that the features are
5628
independent. To see what that looks like, I'll make a joint distribution
5629
for each species assuming independence.
5630
5631
The following function makes a discrete \py{Pmf}
5632
that approximates a normal distribution.
5633
It takes a \py{norm} object as a parameter; \py{sigmas} is the number of standard deviations to include above and below the mean; \py{n} is the number of points in the result.
5634
5635
\begin{code}
5636
def make_pmf(dist, sigmas=3, n=101):
5637
mean, std = dist.mean(), dist.std()
5638
low = mean - sigmas * std
5639
high = mean + sigmas * std
5640
qs = np.linspace(low, high, n)
5641
ps = dist.pdf(qs)
5642
pmf = Pmf(ps, qs)
5643
pmf.normalize()
5644
return pmf
5645
\end{code}
5646
5647
We can use it, along with \py{outer_product} from Section~\ref{outer-operations}, to make a joint distribution of culmen length and
5648
flipper length for each species.
5649
5650
\begin{code}
5651
joint_map = {}
5652
for species in hypos:
5653
pmf1 = make_pmf(culmen_map[species])
5654
pmf2 = make_pmf(flipper_map[species])
5655
joint_map[species] = outer_product(pmf1, pmf2)
5656
\end{code}
5657
5658
And we can use the joint distribution to generate a contour plot.
5659
5660
\begin{code}
5661
def plot_contour(joint, **options):
5662
plt.contour(joint.columns, joint.index, joint, **options)
5663
\end{code}
5664
5665
Figure~\ref{fig10-03} compares the data to joint distributions that
5666
assume independence.
5667
5668
\begin{figure}
5669
\centerline{\includegraphics[width=4in]{figs/fig10-03.pdf}}
5670
\caption{}
5671
\label{fig10-03}
5672
\end{figure}
5673
5674
The contours of a joint normal distribution form ellipses.
5675
In this example, because the features are uncorrelated, the ellipses are
5676
aligned with the axes. But they are not well aligned with the data.
5677
5678
We can make a better model of the data, and use it to compute better
5679
likelihoods, with a multivariate normal distribution.
5680
5681
5682
\section{Multivariate normal distribution}
5683
\label{multivariate-normal-distribution}
5684
5685
As we have seen, a univariate normal distribution is characterized by
5686
its mean and standard deviation or variance (where variance is the
5687
square of standard deviation).
5688
5689
A multivariate normal distribution is characterized by the means of the
5690
features and the \textbf{covariance matrix}, which contains the
5691
variances, which quantify the spread of the features, and the
5692
covariances, which quantify the relationships among them.
5693
5694
We can use the data to estimate the means and covariance matrix for the
5695
population of penguins. First I'll select the columns we want.
5696
5697
\begin{code}
5698
features = df[[var1, var2]]
5699
features.head()
5700
\end{code}
5701
5702
And compute the means.
5703
5704
\begin{code}
5705
mean = features.mean()
5706
mean
5707
\end{code}
5708
5709
\begin{code}
5710
# convert to a DataFrame and write as a table
5711
mean_df = pd.DataFrame(mean, columns=['mean'])
5712
write_table(mean_df, 'table10-04')
5713
\end{code}
5714
5715
The result is a \py{Series} containing the mean
5716
culmen length and flipper length.
5717
5718
We can also compute the covariance matrix:
5719
5720
\begin{code}
5721
cov = features.cov()
5722
write_table(cov, 'table10-05')
5723
cov
5724
\end{code}
5725
5726
The results is a \py{DataFrame} with one row and
5727
one column for each feature. The elements on the diagonal are the
5728
variances; the elements off the diagonal are covariances.
5729
5730
SciPy provides a \py{multivariate_normal} object
5731
we can use to represent a multivariate normal distribution. It takes a
5732
sequence of means and a covariance matrix as parameters:
5733
5734
\begin{code}
5735
from scipy.stats import multivariate_normal
5736
5737
multinorm = multivariate_normal(mean, cov)
5738
multinorm
5739
\end{code}
5740
5741
The following function makes a
5742
\py{multivariate_normal} object for each species.
5743
5744
\begin{code}
5745
def make_multinorm_map(df, varnames):
5746
multinorm_map = {}
5747
grouped = df.groupby('Species2')
5748
for species, group in grouped:
5749
features = group[varnames]
5750
mean = features.mean()
5751
cov = features.cov()
5752
multinorm_map[species] = multivariate_normal(mean, cov)
5753
return multinorm_map
5754
\end{code}
5755
5756
And here's how we use it.
5757
5758
\begin{code}
5759
multinorm_map = make_multinorm_map(df, [var1, var2])
5760
\end{code}
5761
5762
In the next section we'll see what the multivariate normal distribution
5763
looks like.
5764
5765
Then we'll use them to classify penguins, and we'll see if the results
5766
are more accurate than the naive Bayesian classifier.
5767
5768
5769
\section{Visualizing a multivariate normal distribution}
5770
\label{visualizing-a-multivariate-normal-distribution}
5771
5772
This section uses some NumPy magic to generate contour plots for
5773
multivariate normal distributions. If that's interesting for you, great!
5774
Otherwise, feel free to skip to the results. In the next section we'll
5775
do the actual classification, which turns out to be easier than the
5776
visualization.
5777
5778
I'll start by making a contour map for the distribution of features
5779
among Adelie penguins.\\
5780
Here are the univariate distributions for the two features we'll use and
5781
the multivariate distribution we just computed.
5782
5783
\begin{code}
5784
norm1 = culmen_map['Adelie']
5785
norm2 = flipper_map['Adelie']
5786
multinorm = multinorm_map['Adelie']
5787
\end{code}
5788
5789
I'll make a discrete \py{Pmf} approximation for
5790
each of the univariate distributions.
5791
5792
\begin{code}
5793
pmf1 = make_pmf(norm1)
5794
pmf2 = make_pmf(norm2)
5795
\end{code}
5796
5797
And use them to make a mesh that contains all pairs of values.
5798
5799
\begin{code}
5800
X, Y = np.meshgrid(pmf1.qs, pmf2.qs)
5801
\end{code}
5802
5803
The mesh is represented by two arrays, one containing the values along
5804
the $x$ axis, the other containing the values along the $y$ axis.
5805
5806
In order to evaluate the multivariate distribution for each pair of
5807
values, we have to ``stack'' the arrays.
5808
5809
\begin{code}
5810
pos = np.dstack((X, Y))
5811
\end{code}
5812
5813
The result is a 3-D array that you can think of as a 2-D array of pairs.
5814
When we pass this array to \py{multinorm.pdf}, it
5815
evaluates the probability density function of the distribution for each
5816
pair of values.
5817
5818
\begin{code}
5819
a = multinorm.pdf(pos)
5820
\end{code}
5821
5822
The result is an array of probability densities. If we put them in a
5823
\py{DataFrame} and normalize them, the result is a
5824
discrete approximation of the joint distribution of the two features.
5825
5826
\begin{code}
5827
joint = pd.DataFrame(a, index=pmf1.qs, columns=pmf2.qs)
5828
normalize(joint)
5829
\end{code}
5830
5831
Which we can plot with \py{plot_contour}:
5832
5833
\begin{code}
5834
plot_contour(joint)
5835
\end{code}
5836
5837
Figure~\ref{fig10-04} shows a scatter plot of the data along with the
5838
contours of the multivariate normal distribution for each species.
5839
5840
\begin{figure}
5841
% chap01soln.ipynb
5842
\centerline{\includegraphics[width=4in]{figs/fig10-04.pdf}}
5843
\caption{}
5844
\label{fig10-04}
5845
\end{figure}
5846
5847
The contours of a multivariate normal distribution are still ellipses,
5848
but now that we have taken into account the correlation between the
5849
features, the ellipses are no longer aligned with the axes.
5850
5851
Because it takes the correlations into account, the multivariate normal
5852
distribution is a better model for the data. And there is less overlap
5853
in the contours of the three distributions, which suggests that they
5854
should yield better classifications.
5855
5856
\section{A less naive classifier}
5857
\label{a-less-naive-classifier}
5858
5859
In a previous section we used \py{update_penguin}
5860
to update a prior \py{Pmf} based on observed data
5861
and a collection of \py{norm} objects that model
5862
the distribution of observations under each hypothesis. Here it is
5863
again:
5864
5865
\begin{code}
5866
def update_penguin(prior, data, norm_map):
5867
hypos = prior.qs
5868
likelihood = [norm_map[hypo].pdf(data) for hypo in hypos]
5869
posterior = prior * likelihood
5870
posterior.normalize()
5871
return posterior
5872
\end{code}
5873
5874
I wrote this function with \py{norm} objects in
5875
mind, but it also works if the distributions in
5876
\py{norm_map} are
5877
\py{multivariate_normal} objects. So we can call
5878
it like this:
5879
5880
\begin{code}
5881
data = 38, 190
5882
update_penguin(prior, data, multinorm_map)
5883
\end{code}
5884
5885
A penguin with culmen length 38 and flipper length 190 is almost
5886
certainly an Adelie.
5887
5888
\begin{code}
5889
data = 48, 195
5890
update_penguin(prior, data, multinorm_map)
5891
\end{code}
5892
5893
A penguin with culmen length 48 and flipper length 195 is almost
5894
certainly a Chinstrap.
5895
5896
\begin{code}
5897
data = 48, 215
5898
update_penguin(prior, data, multinorm_map)
5899
\end{code}
5900
5901
And a penguin with culmen length 48 and flipper length 215 is almost
5902
certainly a Gentoo.
5903
5904
Let's see if this classifier does any better than the naive Bayesian
5905
classifier. I'll apply it to each penguin in the dataset:
5906
5907
\begin{code}
5908
df['Classification'] = np.nan
5909
5910
for i, row in df.iterrows():
5911
data = row[varnames]
5912
posterior = update_penguin(prior, data, multinorm_map)
5913
df.loc[i, 'Classification'] = posterior.idxmax()
5914
\end{code}
5915
5916
And compute the accuracy:
5917
5918
\begin{code}
5919
accuracy(df)
5920
\end{code}
5921
5922
It turns out to be only a little better: the accuracy is 95.3\%,
5923
compared to 94.7\% for the naive Bayesian classifier.
5924
5925
In one way, that's disappointing. After all that work, it would have
5926
been nice to see a bigger difference.
5927
5928
But in another way, it's good news. In general, a naive Bayesian
5929
classifier is easier to implement and requires less computation. If it
5930
works nearly as well as a more complex algorithm, it might be a good
5931
choice for practical purposes.
5932
5933
But speaking of practical purposes, you might have noticed that this
5934
example isn't very useful. If we want to identify the species of a
5935
penguin, there are easier ways than measuring its flippers and beak.
5936
5937
However, there is are valid scientific uses for this type of
5938
classification. One of them is the subject of the research paper we
5939
started with:
5940
\url{https://en.wikipedia.org/wiki/Sexual_dimorphism}{sexual
5941
dimorphism}, that is, differences in shape between male and female
5942
animals.
5943
5944
In some species, like angler fish, males and females look very
5945
different. In other species, like mockingbirds, they are difficult to
5946
tell apart. And dimorphism is worth studying because it provides insight
5947
into social behavior, sexual selection, and evolution.
5948
5949
One way to quantify the degree of sexual dimorphism in a species is to
5950
use a classification algorithm like the one in this chapter. If you can
5951
find a set of features that makes it possible to classify individuals by
5952
sex with high accuracy, that's evidence of high dimorphism.
5953
5954
As an exercise, you can use the dataset from this chapter to classify
5955
penguins by sex and see which of the three species is the most
5956
dimorphic.
5957
5958
\section{Exercises}
5959
5960
The code for this chapter is in \py{chap10.ipynb}, which is in the repository for this book. See Section~\ref{codeinfo} for details.
5961
You can run the notebook on Colab at \url{https://colab.research.google.com/github/AllenDowney/ThinkBayes2/blob/master/code/chap10.ipynb}.
5962
5963
The notebook provides space where you can work on the following problems.
5964
5965
\begin{exercise} In my example I used culmen length and flipper length
5966
because they seemed to provide the most power to distinguish the three
5967
species. But maybe we can do better by using more features.
5968
5969
Make a naive Bayesian classifier that uses all four measurements in the
5970
dataset: culmen length and depth, flipper length, and body mass. Is it
5971
more accurate than the model with two features?
5972
5973
\end{exercise}
5974
5975
5976
\begin{exercise}
5977
5978
One of the reasons the penguin dataset was collected
5979
was to quantify sexual dimorphism in different penguin species, that is,
5980
physical differences between male and female penguins. One way to
5981
quantify dimorphism is to use measurements to classify penguins by sex.
5982
If a species is more dimorphic, we expect to be able to classify them
5983
more accurately.
5984
5985
As an exercise, pick a species and use a Bayesian classifier (naive or
5986
not) to classify the penguins by sex. Which features are most useful?
5987
What accuracy can you achieve?
5988
\end{exercise}
5989
5990
5991
\chapter{Inference}
5992
5993
Whenever people compare the Bayesian inference with conventional
5994
approaches, one of the questions that comes up most often is something
5995
like, ``What about p-values?'' And one of the most common examples is
5996
the comparison of two groups to see if there is a difference in their
5997
means.
5998
5999
In classical statistical inference, the usual tool for this scenario is
6000
a (\url{https://en.wikipedia.org/wiki/Student\%27s_t-test}) Student's
6001
\textit{t}-test, and the result is a
6002
(\url{https://en.wikipedia.org/wiki/P-value}) p-value. This process is
6003
an example of``null
6004
hypothesis significance testing''.
6005
6006
A Bayesian alternative is to compute the posterior distribution of the
6007
difference between the groups. Then we can use that distribution to
6008
answer whatever questions we are interested in, including the most
6009
likely size of the difference, a credible interval that's likely to
6010
contain the true difference, the probability of superiority, or the
6011
probability that the difference exceeds some threshold.
6012
6013
To demonstrate this process, I'll solve a standard problem from a
6014
statistical textbook, comparing the effect of an educational
6015
``treatment'' compared to a control.
6016
6017
\section{Improving Reading Ability}
6018
6019
We'll use data from a
6020
(\url{https://docs.lib.purdue.edu/dissertations/AAI8807671/})
6021
Ph.D.~dissertation in educational psychology written in 1987, which was used as an example
6022
in a
6023
(\url{https://books.google.com/books/about/Introduction_to_the_practice_of_statisti.html?id=pGBNhajABlUC})
6024
statistics textbook from 1989 and published on
6025
(\url{https://web.archive.org/web/20000603124754/http://lib.stat.cmu.edu/DASL/Datafiles/DRPScores.html}) DASL,
6026
a web page that collects data stories.
6027
6028
Here's the description from DASL:
6029
6030
\begin{quote}
6031
An educator conducted an experiment to test whether new directed reading
6032
activities in the classroom will help elementary school pupils improve
6033
some aspects of their reading ability. She arranged for a third grade
6034
class of 21 students to follow these activities for an 8-week period. A
6035
control classroom of 23 third graders followed the same curriculum
6036
without the activities. At the end of the 8 weeks, all students took a
6037
Degree of Reading Power (DRP) test, which measures the aspects of
6038
reading ability that the treatment is designed to improve.
6039
\end{quote}
6040
6041
The data are in the repository for this book.
6042
I'll use Pandas to load the data into a \py{DataFrame}:
6043
6044
\begin{code}
6045
import pandas as pd
6046
6047
df = pd.read_csv('drp_scores.csv', skiprows=21, delimiter='\t')
6048
\end{code}
6049
6050
And \py{groupby} to separate the data for the
6051
\py{Treated} and \py{Control}
6052
groups:
6053
6054
\begin{code}
6055
grouped = df.groupby('Treatment')
6056
responses = {}
6057
6058
for name, group in grouped:
6059
responses[name] = group['Response']
6060
\end{code}
6061
6062
Figure~\ref{fig11-01} shows the cumulative distributions of the scores for the two groups, and here are their summary statistics.
6063
6064
\begin{stdout}
6065
Group n mean std
6066
----- -- ---- ---
6067
Control 23 41.5 17.1
6068
Treated 21 51.5 11.0
6069
\end{stdout}
6070
6071
\begin{figure}
6072
\centerline{\includegraphics[width=4in]{figs/fig11-01.pdf}}
6073
\caption{CDF of test scores for treated group and control group.}
6074
\label{fig11-01}
6075
\end{figure}
6076
6077
The distribution of scores is not exactly normal for either group, but
6078
it is close enough that the normal model is a reasonable choice.
6079
6080
So I'll assume that in the entire population of students (not just the
6081
ones in the experiment), the distribution of scores is well modeled by a
6082
normal distribution with unknown mean and standard deviation. I'll use
6083
\py{mu} and \py{sigma} to
6084
denote these unknown population parameters.
6085
6086
And we'll do a Bayesian update to estimate what they are.
6087
6088
\section{Estimating parameters}
6089
6090
As always, we need a prior distribution for the parameters.
6091
Since there are two parameters, it will be a joint distribution.
6092
I'll construct it by choosing marginal distributions for each parameter
6093
and computing their outer product.
6094
6095
As a simple starting place, I'll assume that the prior distributions for
6096
\py{mu} and \py{sigma} are
6097
uniform.
6098
6099
\begin{code}
6100
mus = np.linspace(20, 80, 101)
6101
prior_mu = Pmf(1, mus, name='mean')
6102
6103
sigmas = np.linspace(5, 30, 101)
6104
prior_sigma = Pmf(1, sigmas, name='std')
6105
\end{code}
6106
6107
Assuming that the parameters are independent, we can use \py{outer_product} from Section~\ref{outer-operations} to construct the joint prior distribution.
6108
6109
\begin{code}
6110
from utils import outer_product
6111
6112
prior = outer_product(prior_mu, prior_sigma)
6113
\end{code}
6114
6115
Now, we would like to know the probability of each score in the dataset
6116
for each hypothetical pair of values, \py{mu} and
6117
\py{sigma}. I'll do that by making a 3-dimensional
6118
grid with values of \py{sigma} on the first axis,
6119
values of \py{mu} on the second axis, and the
6120
scores from the control group on the third axis.
6121
6122
\begin{code}
6123
data = responses['Control']
6124
6125
sigmas, mus, data_mesh = np.meshgrid(prior.columns,
6126
prior.index,
6127
data)
6128
\end{code}
6129
6130
Now we can use \py{norm.pdf} to compute the
6131
probability density of each score for each hypothetical pair of
6132
parameters.
6133
6134
\begin{code}
6135
from scipy.stats import norm
6136
6137
densities = norm.pdf(data_mesh, sigmas, mus)
6138
\end{code}
6139
6140
The result is a 3-D array. To compute likelihoods, I'll compute the
6141
product of these densities along the third axis, that is,
6142
\py{axis=2}:
6143
6144
\begin{code}
6145
likelihood = densities.prod(axis=2)
6146
likelihood.shape
6147
\end{code}
6148
6149
The result is a 2-D array that contains the likelihood of the entire
6150
dataset for each hypothetical pair of parameters.
6151
6152
We can use this array as part of a Bayesian update, as in this function:
6153
6154
\begin{code}
6155
from utils import normalize
6156
6157
def update_norm(prior, data):
6158
X, Y, Z = np.meshgrid(prior.columns, prior.index, data)
6159
likelihood = norm.pdf(Z, Y, X).prod(axis=2)
6160
6161
posterior = prior * likelihood
6162
normalize(posterior)
6163
return posterior
6164
\end{code}
6165
6166
Here are the updates for the control and treatment groups:
6167
6168
\begin{code}
6169
data = responses['Control']
6170
posterior_control = update_norm(prior, data)
6171
6172
data = responses['Treated']
6173
posterior_treated = update_norm(prior, data)
6174
\end{code}
6175
6176
Figure~\ref{fig11-02} shows what the joint posterior distributions look like.
6177
6178
\begin{figure}
6179
\centerline{\includegraphics[width=4in]{figs/fig11-02.pdf}}
6180
\caption{Joint posterior distributions for the treated and control groups.}
6181
\label{fig11-02}
6182
\end{figure}
6183
6184
Along the vertical axis, it looks like the mean score for the treated
6185
group is higher. Along the horizontal axis, it looks like the standard
6186
deviation for the control group is higher.
6187
6188
If we think the treatment causes these differences, the data suggest
6189
that the treatment increases the mean score and decreases their spread.
6190
We can see these differences more clearly by looking at the marginal
6191
distributions for \py{mu} and
6192
\py{sigma}.
6193
6194
\section{Posterior marginal distributions}
6195
6196
I'll use \py{marginal}, which we saw in Section~\ref{marginals},
6197
to extract the posterior marginal distributions for the population means.
6198
6199
\begin{code}
6200
from utils import marginal
6201
6202
pmf_mean_control = marginal(posterior_control, 1)
6203
pmf_mean_treated = marginal(posterior_treated, 1)
6204
\end{code}
6205
6206
Figure~\ref{fig11-03} shows what they look like.
6207
It seems like we are pretty sure that the population mean in the treated
6208
group is higher.
6209
6210
\begin{figure}
6211
\centerline{\includegraphics[width=4in]{figs/fig11-03.pdf}}
6212
\caption{}
6213
\label{fig11-03}
6214
\end{figure}
6215
6216
We can use \py{prob_gt} to
6217
compute the probability of superiority:
6218
6219
\begin{code}
6220
Pmf.prob_gt(pmf_mean_treated, pmf_mean_control)
6221
\end{code}
6222
6223
There is a 98\% chance that the mean in the treated group is higher.
6224
6225
We can use \py{sub_dist} to compute the
6226
distribution of the difference.
6227
6228
\begin{code}
6229
diff = Pmf.sub_dist(pmf_mean_treated, pmf_mean_control)
6230
\end{code}
6231
6232
But there are two things to be careful about when we use methods like
6233
\py{sub_dist}.
6234
6235
The first is that the result usually contains more elements than the
6236
original \py{Pmf}.
6237
In this example, the original distributions have the same quantities, so
6238
the size increase is moderate.
6239
But in the worst case, the size of the result can be the product of the
6240
sizes of the originals.
6241
6242
The other thing to be aware of is that plotting a
6243
\py{Pmf} does not always work well. In this
6244
example, if we plot the distribution of differences, the result is
6245
pretty noisy.
6246
6247
There are two ways to work around that limitation. One is to plot the
6248
CDF, which smooths out the noise.
6249
6250
The other option is to use kernel density estimation (KDE) to make a
6251
smooth approximation of the PDF on an equally-spaced grid.
6252
The following function takes a \py{Pmf} and the number of points on the grid, and returns a smooth \py{Pmf}, ready for plotting.
6253
6254
\begin{code}
6255
from scipy.stats import gaussian_kde
6256
6257
def make_kde(pmf, n=101):
6258
kde = gaussian_kde(pmf.qs, weights=pmf.ps)
6259
qs = np.linspace(pmf.qs.min(), pmf.qs.max(), n)
6260
ps = kde.evaluate(qs)
6261
pmf = Pmf(ps, qs)
6262
pmf.normalize()
6263
return pmf
6264
\end{code}
6265
6266
Figure~\ref{fig11-04} shows what it looks like.
6267
The mean is almost 10 points, which is substantial.
6268
6269
Finally, we can use \py{credible_interval} to
6270
compute a 90\% credible interval.
6271
6272
\begin{code}
6273
diff.credible_interval(0.9)
6274
\end{code}
6275
6276
Based on the data, we are pretty sure the treatment improves test scores
6277
by 2.4 to 17.4 points.
6278
6279
\section{Using summary statistics}
6280
6281
In this example the dataset is not very big, so it doesn't take too long
6282
to compute the probability of every score under every hypothesis. But
6283
the result is a 3-D array; for larger datasets, it might be too big to
6284
compute practically.
6285
6286
Also, with larger datasets the likelihoods get very small, sometimes so
6287
small that we can't compute them with normal floating-point arithmetic.
6288
That's because we are computing the probability of a particular dataset;
6289
the number of possible datasets is astronomically big, so the
6290
probability of any of them is very small.
6291
6292
An alternative is to compute a summary of the dataset and compute the
6293
likelihood of the summary. For example, if we compute the sample mean of
6294
the data and the sample standard deviation, we could compute the
6295
likelihood of those summary statistics under each hypothesis.
6296
6297
As an example, suppose we know that the population mean is 40 and the
6298
standard deviation is 17. We can make a \py{norm}
6299
object that represents a normal distribution with these parameters:
6300
6301
\begin{code}
6302
mu = 40
6303
sigma = 17
6304
dist = norm(mu, sigma)
6305
\end{code}
6306
6307
Now suppose we draw 1000 samples from this distribution with sample size
6308
\py{n=20}. I'll use \py{rvs},
6309
which generates a random sample, to simulate this experiment.
6310
6311
\begin{code}
6312
n = 20
6313
samples = dist.rvs((1000, n))
6314
samples.shape
6315
\end{code}
6316
6317
The result is an array with 1000 rows, each containing a sample with 20
6318
columns.
6319
6320
If we compute the mean of each row, the result is an array that contains
6321
1000 sample means; that is, each value is the mean of a sample with
6322
\py{n=20}.
6323
6324
\begin{code}
6325
sample_means = samples.mean(axis=1)
6326
sample_means.shape
6327
\end{code}
6328
6329
Now, we would like to know what the distribution of these sample means
6330
is. Using the properties of the normal distribution,
6331
(\url{https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables}) we
6332
can show that their distribution is normal with mean $\mu$ and
6333
standard deviation $\sigma/\sqrt{n}$:
6334
6335
\begin{code}
6336
dist_m = norm(mu, sigma/np.sqrt(n))
6337
\end{code}
6338
6339
\py{dist_m} represents the ``sampling distribution
6340
of the mean''.
6341
In the notebook for this chapter, you'll see that the random sample means follow the theoretical
6342
distribution closely, as expected.
6343
6344
We can also compute standard deviations for each row in
6345
\py{samples}.
6346
6347
\begin{code}
6348
sample_stds = samples.std(axis=1)
6349
sample_stds.shape
6350
\end{code}
6351
6352
The result is an array of sample standard deviations. We might wonder
6353
what the distribution of these values is. The
6354
(\url{https://en.wikipedia.org/wiki/Normal_distribution\#Sample_variance}) derivation
6355
is not as easy, but if we transform the sample standard deviations like
6356
this:
6357
6358
$t = n s^2 / \sigma^2$
6359
6360
where $n$ is the sample size, $s$ is the sample standard deviation,
6361
and $\sigma$ is the population standard deviation, the transformed
6362
values follow a
6363
(\url{https://en.wikipedia.org/wiki/Chi-square_distribution}) chi-square
6364
distribution with $n-1$ degrees of freedom.
6365
6366
Here are the transformed values.
6367
6368
\begin{code}
6369
transformed = n * sample_stds**2 / sigma**2
6370
\end{code}
6371
6372
And I'll create a \py{chi2} object that represents
6373
a chi-square distribution.
6374
6375
\begin{code}
6376
from scipy.stats import chi2
6377
6378
dist_s = chi2(n-1)
6379
\end{code}
6380
6381
In the notebook you'll see that the distribution of transformed sample standard deviations agrees with
6382
the theoretical distribution.
6383
6384
I think it is useful to check theoretical results like this, for a few
6385
reasons:
6386
6387
\begin{itemize}
6388
\item
6389
It confirms that my understanding of the theory is correct,
6390
6391
\item
6392
It confirms that the conditions where I am applying the theory are
6393
conditions where the theory holds,
6394
6395
\item
6396
It confirms that the implementation details are correct. For many
6397
distributions, there is more than one way to specify the parameters.
6398
If you use the wrong specification, this kind of testing will help you
6399
catch the error.
6400
\end{itemize}
6401
6402
Before we move on, I'll mention one other theoretical result we will
6403
use: (\url{https://en.wikipedia.org/wiki/Basu\%27s_theorem})
6404
Basu's theorem, which states that the sample mean and sample standard
6405
deviation are independent.
6406
6407
6408
\section{Update with summary statistics}
6409
6410
Now we're ready to do an update. I'll compute summary statistics for the
6411
two groups.
6412
6413
\begin{code}
6414
summary = {}
6415
for name, response in responses.items():
6416
summary[name] = (len(response),
6417
response.mean(),
6418
response.std())
6419
\end{code}
6420
6421
The result is a dictionary that maps from group name to a tuple that
6422
contains the sample size, \py{n}, the sample mean,
6423
\py{m}, and the sample standard deviation
6424
\py{s}, for each group.
6425
6426
I'll demonstrate the update with the summary statistics from the control
6427
group.
6428
6429
\begin{code}
6430
n, m, s = summary['Control']
6431
\end{code}
6432
6433
I'll make a mesh with hypothetical values of
6434
\py{mu} on the vertical axis and values of
6435
\py{sigma} on the horizontal axis.
6436
6437
\begin{code}
6438
sigmas, mus = np.meshgrid(prior.columns, prior.index)
6439
sigmas.shape
6440
\end{code}
6441
6442
Now we can compute the likelihood of seeing the sample mean,
6443
\py{m}, for each pair of parameters.
6444
6445
\begin{code}
6446
like1 = norm.pdf(m, mus, sigmas/np.sqrt(n))
6447
\end{code}
6448
6449
And use it to update the prior.
6450
6451
\begin{code}
6452
posterior1 = prior * like1
6453
normalize(posterior1)
6454
\end{code}
6455
6456
Next we compute the likelihood of seeing the sample standard deviation, \py{s}, for each pair of parameters.
6457
6458
\begin{code}
6459
like2 = chi2.pdf(n * s**2 / sigmas**2, n-1)
6460
\end{code}
6461
6462
And here's the second update:
6463
6464
\begin{code}
6465
posterior2 = posterior1 * like2
6466
normalize(posterior2)
6467
\end{code}
6468
6469
The following function does both updates, using the sample mean and
6470
standard deviation.
6471
6472
\begin{code}
6473
def update_norm_summary(prior, data):
6474
n, m, s = data
6475
sigmas, mus = np.meshgrid(prior.columns, prior.index)
6476
6477
like1 = norm.pdf(m, mus, sigmas/np.sqrt(n))
6478
like2 = chi2.pdf(n * s**2 / sigmas**2, n-1)
6479
6480
posterior = prior * like1 * like2
6481
normalize(posterior)
6482
6483
return posterior
6484
\end{code}
6485
6486
Here are the updates for the two groups.
6487
6488
\begin{code}
6489
data = summary['Control']
6490
posterior_control2 = update_norm_summary(prior, data)
6491
6492
data = summary['Treated']
6493
posterior_treated2 = update_norm_summary(prior, data)
6494
\end{code}
6495
6496
You can see the results in the notebook for this chapter.
6497
Visually, these posterior joint distributions are similar to the ones we
6498
computed using the entire datasets, not just the summary statistics.
6499
But they are not exactly the same, as we'll see by comparing the marginal
6500
distributions.
6501
6502
\section{Comparing marginals}
6503
6504
Again, let's extract the marginal posterior distributions.
6505
6506
\begin{code}
6507
pmf_mean_control2 = marginal(posterior_control2, 1)
6508
pmf_mean_treated2 = marginal(posterior_treated2, 1)
6509
\end{code}
6510
6511
And compare them to results we got using the entire dataset.
6512
Figure~\ref{fig11-05} shows the results.
6513
6514
\begin{figure}
6515
\centerline{\includegraphics[width=4in]{figs/fig11-05.pdf}}
6516
\caption{}
6517
\label{fig11-05}
6518
\end{figure}
6519
6520
For both groups, the distribution of \py{mu} is a little wider when we use only the summary statistics; that is, we are a little less certain about the values of the means.
6521
6522
If we compute the posterior distribution of the difference in means,
6523
the mean difference is nearly the same, but the credible interval is a bit wider.
6524
6525
That's because the update we did is based on the implicit assumption
6526
that the distribution of the data is actually normal, but it's not.
6527
As a result, when we replace the dataset with the summary statistics, we lose some information about the true distribution of the data. With less
6528
information, we are less certain about the parameters.
6529
6530
\section{Summary}
6531
6532
In this chapter we used a joint distribution to represent prior
6533
probabilities for the parameters of a normal distribution,
6534
\py{mu} and \py{sigma}.
6535
6536
And we updated that distribution two ways: first using the entire
6537
dataset and the normal PDF; then using summary statistics, the normal
6538
PDF, and the chi-square PDF.
6539
6540
Using summary statistics is computationally more efficient, but it loses
6541
some information in the process.
6542
6543
Normal distributions appear in many domains, as well as other
6544
distributions that are well approximated by normal distributions. So the
6545
methods in this chapter are broadly applicable. The exercises at the end
6546
of the chapter will give you a chance to apply them.
6547
6548
\section{Exercises}
6549
6550
\begin{exercise}
6551
Looking again at the posterior joint distribution of
6552
\py{mu} and \py{sigma}, it
6553
seems like the standard deviation of the treated group might be lower;
6554
if so, that would suggest that the treatment is more effective for
6555
students with lower scores.
6556
6557
But before we speculate too much, we should estimate the size of the
6558
difference and see whether it might actually be 0.
6559
6560
As we did with the values of \py{mu} in the
6561
previous section, extract the posterior marginal distributions of
6562
\py{sigma} for the two groups. What is the
6563
probability that the standard deviation is higher in the control group?
6564
6565
Compute the distribution of the difference in
6566
\py{sigma} between the two groups. What is the mean
6567
of this difference? What is the 90\% credible interval?
6568
6569
\end{exercise}
6570
6571
6572
\begin{exercise}
6573
An ``effect size'' is a statistic intended to quantify the magnitude of a phenomenon (see \url{http://en.wikipedia.org/wiki/Effect_size}).
6574
If the phenomenon is a difference in means between two groups, a common way to quantify it is Cohen's effect size, denoted $d$.
6575
6576
If the parameters for Group 1 are $(\mu_1, \sigma_1)$, and the
6577
parameters for Group 2 are $(\mu_2, \sigma_2)$, Cohen's
6578
effect size is
6579
%
6580
\[ d = \frac{\mu_1 - \mu_2}{(\sigma_1 + \sigma_2)/2} \]
6581
%
6582
Use the joint posterior distributions for the two groups to compute the posterior distribution for Cohen's effect size.
6583
Then compute the mean and 90\% credible interval.
6584
6585
Hint: if enumerating all pairs from the two distributions takes too
6586
long, consider random sampling.
6587
\end{exercise}
6588
6589
6590
\begin{exercise}
6591
This exercise is inspired by
6592
(\url{https://www.reddit.com/r/statistics/comments/hcvl2j/q_reverse_empirical_distribution_rule_question/}) a
6593
question that appeared on Reddit.
6594
6595
An instructor announces the results of an exam like this, ``The average
6596
score on this exam was 81. Out of 25 students, 5 got more than 90, and I
6597
am happy to report that no one failed (got less than 60).''
6598
6599
Based on this information, what do you think the standard deviation of
6600
scores was?
6601
6602
You can assume that the distribution of scores is approximately normal.
6603
And let's assume that the sample mean, 81, is actually the population
6604
mean, so we only have to estimate \py{sigma}.
6605
6606
Hint: To compute the probability of a score greater than 90, you can use
6607
\py{norm.sf}, which computes the survival function,
6608
also known as the complementary CDF, or
6609
\py{1 - cdf(x)}.
6610
6611
\end{exercise}
6612
6613
6614
\begin{exercise}
6615
I have a soft spot for crank science, so this
6616
exercise is about the
6617
\url{http://en.wikipedia.org/wiki/Variability_hypothesis}{Variability
6618
Hypothesis}, which
6619
6620
\begin{quote}
6621
``originated in the early nineteenth century with Johann Meckel, who
6622
argued that males have a greater range of ability than females,
6623
especially in intelligence. In other words, he believed that most
6624
geniuses and most mentally retarded people are men. Because he
6625
considered males to be the 'superior animal,' Meckel concluded that
6626
females' lack of variation was a sign of inferiority.''
6627
\end{quote}
6628
6629
I particularly like that last part because I suspect that if it turned
6630
out that women were \emph{more} variable, Meckel would have taken that
6631
as a sign of inferiority, too.
6632
6633
Nevertheless, the Variability Hypothesis suggests an exercise we can use
6634
to practice the methods in this chapter. Let's look at the distribution
6635
of heights for men and women in the U.S. and see who is more variable.
6636
6637
I used 2018 data from the CDC's
6638
\url{https://www.cdc.gov/brfss/annual_data/annual_2018.html}{Behavioral
6639
Risk Factor Surveillance System} (BRFSS), which includes self-reported
6640
heights from 154407 men and 254722 women.
6641
6642
Here's what I found:
6643
6644
\begin{itemize}
6645
\item
6646
The average height for men is 178 cm; the average height for women is
6647
163 cm. So men are taller on average; no surprise there.
6648
\item
6649
For men the standard deviation is 8.27 cm; for women it is 7.75 cm. So
6650
in absolute terms, men's heights are more variable.
6651
\end{itemize}
6652
6653
But to compare variability between groups, it is more meaningful to use
6654
the
6655
(\url{https://en.wikipedia.org/wiki/Coefficient_of_variation}) coefficient
6656
of variation (CV), which is the standard deviation divided by the mean.
6657
It is a dimensionless measure of variability relative to scale.
6658
6659
For men CV is 0.0465; for women it is 0.0475. The coefficient of
6660
variation is higher for women, so this dataset provides evidence against
6661
the Variability Hypothesis. But we can use Bayesian methods to make that
6662
conclusion more precise.
6663
6664
Use these summary statistics to compute the posterior distribution of
6665
\py{mu} and \py{sigma} for the
6666
distributions of male and female height. Use
6667
\py{Pmf.div_dist} to compute posterior
6668
distributions of CV. Based on this dataset and the assumption that the
6669
distribution of height is normal, what is the probability that the
6670
coefficient of variation is higher for men? What is the most likely
6671
ratio of the CVs and what is the 90\% credible interval for that ratio?
6672
6673
Hint: Use different prior distributions for the two groups, and chose
6674
them so they cover all parameters with non-negligible probability.
6675
6676
\end{exercise}
6677
6678
6679
\chapter{Observer Bias}
6680
\label{observer}
6681
6682
\section{The Red Line problem}
6683
6684
In Massachusetts, the Red Line is a subway that connects
6685
Cambridge and Boston. When I was working in Cambridge I took the Red
6686
Line from Kendall Square to South Station and caught the commuter rail
6687
to Needham. During rush hour Red Line trains run every 7--8
6688
minutes, on average.
6689
\index{Red Line problem}
6690
\index{Boston}
6691
6692
When I arrived at the station, I could estimate the time until
6693
the next train based on the number of passengers on the platform.
6694
If there were only a few people, I inferred that I just missed
6695
a train and expected to wait about 7 minutes. If there were
6696
more passengers, I expected the train to arrive sooner. But if
6697
there were a large number of passengers, I suspected that
6698
trains were not running on schedule, so I would go back to the
6699
street level and get a taxi.
6700
6701
While I was waiting for trains, I thought about how Bayesian
6702
estimation could help predict my wait time and decide when I
6703
should give up and take a taxi. This chapter presents the
6704
analysis I came up with.
6705
6706
This chapter is based on a project by Brendan Ritter and
6707
Kai Austin, who took a class with me at Olin College.
6708
The code in this chapter is available from
6709
\url{http://thinkbayes.com/redline.py}. The code I used
6710
to collect data is in \url{http://thinkbayes.com/redline_data.py}.
6711
For more information
6712
see Section~\ref{download}.
6713
\index{Olin College}
6714
6715
6716
\section{The model}
6717
6718
\begin{figure}
6719
% redline.py
6720
\centerline{\includegraphics[height=2.5in]{figs/redline0.pdf}}
6721
\caption{PMF of gaps between trains, based on collected data,
6722
smoothed by KDE. \py{z} is the actual distribution; \py{zb}
6723
is the biased distribution seen by passengers. }
6724
\label{fig.redline0}
6725
\end{figure}
6726
6727
Before we get to the analysis, we have to make some
6728
modeling decisions. First, I will treat passenger arrivals as
6729
a Poisson process, which means I assume that passengers are equally
6730
likely to arrive at any time, and that they arrive at an unknown
6731
rate, $\lam$, measured in passengers per minute. Since I
6732
observe passengers during a short period of time, and at the same
6733
time every day, I assume that $\lam$ is constant.
6734
\index{Poisson process}
6735
6736
On the other hand, the arrival process for trains is not Poisson.
6737
Trains to Boston are supposed to leave from the end of the line
6738
(Alewife station) every 7--8 minutes during peak times, but by the time
6739
they get to Kendall Square, the time between trains varies between 3
6740
and 12 minutes.
6741
6742
To gather data on the time between trains, I wrote a script that
6743
downloads real-time data from
6744
\url{http://www.mbta.com/rider_tools/developers/}, selects south-bound
6745
trains arriving at Kendall square, and records their arrival times
6746
in a database. I ran the script from 4pm to 6pm every weekday
6747
for 5 days, and recorded about 15 arrivals per day. Then
6748
I computed the time between consecutive arrivals; the distribution
6749
of these gaps is shown in Figure~\ref{fig.redline0}, labeled \py{z}.
6750
6751
If you stood on the platform from 4pm to 6pm and recorded the time
6752
between trains, this is the distribution you would see. But if you
6753
arrive at some random time (without regard to the train schedule) you
6754
would see a different distribution. The average time
6755
between trains, as seen by a random passenger, is substantially
6756
higher than the true average.
6757
6758
Why? Because a passenger is more like to arrive during a
6759
large interval than a small one. Consider a simple example:
6760
suppose that the time between trains is either 5 minutes
6761
or 10 minutes with equal probability. In that case
6762
the average time between
6763
trains is 7.5 minutes.
6764
6765
But a passenger is more likely to arrive during a 10 minute gap
6766
than a 5 minute gap; in fact, twice as likely. If we surveyed
6767
arriving passengers, we would find that 2/3 of them arrived during
6768
a 10 minute gap, and only 1/3 during a 5 minute gap. So the
6769
average time between trains, as seen by an arriving passenger,
6770
is 8.33 minutes.
6771
6772
This kind of {\bf observer bias} appears in many contexts. Students
6773
think that classes are bigger than they are because more of them are
6774
in the big classes. Airline passengers think that planes are fuller
6775
than they are because more of them are on full flights.
6776
\index{observer bias}
6777
6778
In each case, values from the actual distribution are
6779
oversampled in proportion to their value. In the Red Line example,
6780
a gap that is twice as big is twice as likely to be observed.
6781
6782
So given the actual distribution of gaps, we can compute the
6783
distribution of gaps as seen by passengers. \py{BiasPmf}
6784
does this computation:
6785
6786
\begin{code}
6787
def BiasPmf(pmf):
6788
new_pmf = pmf.Copy()
6789
6790
for x, p in pmf.Items():
6791
new_pmf.Mult(x, x)
6792
6793
new_pmf.Normalize()
6794
return new_pmf
6795
\end{code}
6796
6797
\py{pmf} is the actual distribution; \verb"new_pmf" is the
6798
biased distribution. Inside the loop, we multiply the
6799
probability of each value, \py{x}, by the likelihood it will
6800
be observed, which is proportional to \py{x}. Then we
6801
normalize the result.
6802
6803
Figure~\ref{fig.redline0} shows the actual distribution of gaps,
6804
labeled \py{z}, and the distribution of gaps seen by passengers,
6805
labeled \py{zb} for ``z biased''.
6806
6807
6808
\section{Wait times}
6809
6810
\begin{figure}
6811
% redline.py
6812
\centerline{\includegraphics[height=2.5in]{figs/redline2.pdf}}
6813
\caption{CDF of \py{z}, \py{zb}, and the wait time seen
6814
by passengers, \py{y}. }
6815
\label{fig.redline2}
6816
\end{figure}
6817
6818
Wait time, which I call \py{y}, is the time between the arrival
6819
of a passenger and the next arrival of a train. Elapsed time, which I
6820
call \py{x}, is the time between the arrival of the previous
6821
train and the arrival of a passenger. I chose these definitions
6822
so that \py{zb = x + y}.
6823
6824
Given the distribution of \py{zb}, we can compute the distribution of
6825
\py{y}. I'll start with a simple case and then generalize.
6826
Suppose, as in the previous example, that \py{zb} is either 5 minutes
6827
with probability 1/3, or 10 minutes with probability 2/3.
6828
6829
If we arrive at a random time during a 5 minute gap,
6830
\py{y} is uniform from 0 to 5 minutes. If we arrive during a 10
6831
minute gap, \py{y} is uniform from 0 to 10. So the overall
6832
distribution is a mixture of uniform distributions weighted
6833
according to the probability of each gap.
6834
\index{uniform distribution}
6835
6836
The following function takes the distribution of \py{zb} and
6837
computes the distribution of \py{y}:
6838
6839
\begin{code}
6840
def PmfOfWaitTime(pmf_zb):
6841
metapmf = thinkbayes.Pmf()
6842
for gap, prob in pmf_zb.Items():
6843
uniform = MakeUniformPmf(0, gap)
6844
metapmf.Set(uniform, prob)
6845
6846
pmf_y = thinkbayes.MakeMixture(metapmf)
6847
return pmf_y
6848
\end{code}
6849
6850
\py{PmfOfWaitTime} makes a meta-Pmf that maps from each uniform
6851
distribution to its probability. Then it uses \py{MakeMixture},
6852
which we saw in Section~\ref{mixture}, to compute the mixture.
6853
\index{mixture}
6854
\index{MakeMixture}
6855
\index{meta-Pmf}
6856
6857
\py{PmfOfWaitTime} also uses \py{MakeUniformPmf}, defined here:
6858
6859
\begin{code}
6860
def MakeUniformPmf(low, high):
6861
pmf = thinkbayes.Pmf()
6862
for x in MakeRange(low=low, high=high):
6863
pmf.Set(x, 1)
6864
pmf.Normalize()
6865
return pmf
6866
\end{code}
6867
6868
\py{low} and \py{high} are the range of the uniform distribution,
6869
(both ends included). Finally, \py{MakeUniformPmf} uses {\tt
6870
MakeRange}, defined here:
6871
6872
\begin{code}
6873
def MakeRange(low, high, skip=10):
6874
return range(low, high+skip, skip)
6875
\end{code}
6876
6877
\py{MakeRange} defines a set of possible values for wait time
6878
(expressed in seconds). By default it divides the range into
6879
10 second intervals.
6880
6881
To encapsulate the process of computing these distributions, I
6882
created a class called \py{WaitTimeCalculator}:
6883
6884
\begin{code}
6885
class WaitTimeCalculator(object):
6886
6887
def __init__(self, pmf_z):
6888
self.pmf_z = pmf_z
6889
self.pmf_zb = BiasPmf(pmf)
6890
6891
self.pmf_y = self.PmfOfWaitTime(self.pmf_zb)
6892
self.pmf_x = self.pmf_y
6893
\end{code}
6894
6895
The parameter, \verb"pmf_z", is the unbiased distribution of \py{z}.
6896
\verb"pmf_zb" is the biased distribution of gap time, as seen by
6897
passengers.
6898
6899
\verb"pmf_y" is the distribution of wait time. \verb"pmf_x" is the
6900
distribution of elapsed time, which is the same as the distribution of
6901
wait time. To see why, remember that for a particular value of
6902
\py{zp}, the distribution of \py{y} is uniform from 0 to \py{zp}.
6903
Also
6904
%
6905
\begin{code}
6906
x = zp - y
6907
\end{code}
6908
%
6909
So the distribution of \py{x} is also uniform from 0 to \py{zp}.
6910
6911
Figure~\ref{fig.redline2} shows the distribution of \py{z}, \py{zb},
6912
and \py{y} based on the data I collected from the Red Line web site.
6913
6914
To present these distributions, I am switching from Pmfs to Cdfs.
6915
Most people are more familiar with Pmfs, but I think Cdfs are easier
6916
to interpret, once you get used to them. And if you want to plot
6917
several distributions on the same axes, Cdfs are the way to go.
6918
\index{Cdf}
6919
\index{cumulative distribution function}
6920
6921
The mean of \py{z} is 7.8 minutes. The mean of \py{zb} is 8.8
6922
minutes, about 13\% higher. The mean of \py{y} is 4.4, half
6923
the mean of \py{zb}.
6924
6925
As an aside, the Red Line schedule reports that trains run every
6926
9 minutes during peak times. This is close to the average of
6927
\py{zb}, but higher than the average of \py{z}. I exchanged email
6928
with a representative of the MBTA, who confirmed that the reported
6929
time between trains is deliberately conservative in order to
6930
account for variability.
6931
6932
6933
\section{Predicting wait times}
6934
\label{elapsed}
6935
6936
\begin{figure}
6937
% redline.py
6938
\centerline{\includegraphics[height=2.5in]{figs/redline3.pdf}}
6939
\caption{Prior and posterior of \py{x} and predicted \py{y}. }
6940
\label{fig.redline3}
6941
\end{figure}
6942
6943
Let's get back to the motivating question: suppose that when
6944
I arrive at the platform I see 10 people waiting.
6945
How long should I expect to wait until the next train arrives?
6946
6947
As always, let's start with the easiest version of the problem
6948
and work our way up. Suppose we are given the actual distribution of
6949
\py{z}, and we know that the passenger arrival rate,
6950
$\lam$, is 2 passengers per minute.
6951
6952
In that case we can:
6953
6954
\begin{enumerate}
6955
6956
\item Use the distribution of \py{z} to compute
6957
the prior distribution of \py{zp}, the time between trains
6958
as seen by a passenger.
6959
6960
\item Then we can use the number of passengers to estimate the distribution
6961
of \py{x}, the elapsed time since the last train.
6962
6963
\item Finally, we use the relation \py{y = zp - x} to get the
6964
distribution of \py{y}.
6965
6966
\end{enumerate}
6967
6968
The first step is to create a \py{WaitTimeCalculator} that
6969
encapsulates the distributions of \py{zp}, \py{x},
6970
and \py{y}, prior to taking into account the number of
6971
passengers.
6972
6973
\begin{code}
6974
wtc = WaitTimeCalculator(pmf_z)
6975
\end{code}
6976
6977
\verb"pmf_z" is the given distribution of gap times.
6978
6979
The next step is to make an \py{ElapsedTimeEstimator} (defined
6980
below), which encapsulates the posterior distribution of \py{x} and
6981
the predictive distribution of \py{y}.
6982
\index{predictive distribution}
6983
6984
\begin{code}
6985
ete = ElapsedTimeEstimator(wtc,
6986
lam=2.0/60,
6987
num_passengers=15)
6988
\end{code}
6989
6990
The parameters are the \py{WaitTimeCalculator}, the passenger
6991
arrival rate, \py{lam} (expressed in passengers per second),
6992
and the observed number of passengers, let's say 15.
6993
6994
Here is the definition of \py{ElapsedTimeEstimator}:
6995
6996
\begin{code}
6997
class ElapsedTimeEstimator(object):
6998
6999
def __init__(self, wtc, lam, num_passengers):
7000
self.prior_x = Elapsed(wtc.pmf_x)
7001
7002
self.post_x = self.prior_x.Copy()
7003
self.post_x.Update((lam, num_passengers))
7004
7005
self.pmf_y = PredictWaitTime(wtc.pmf_zb, self.post_x)
7006
\end{code}
7007
7008
\verb"prior_x" and \verb"posterior_x" are the prior and
7009
posterior distributions of elapsed time. \verb"pmf_y" is
7010
the predictive distribution of wait time.
7011
7012
\py{ElapsedTimeEstimator} uses \py{Elapsed} and \py{PredictWaitTime},
7013
defined below.
7014
7015
\py{Elapsed} is a Suite that represents the hypothetical
7016
distribution of \py{x}. The prior distribution of \py{x}
7017
comes straight from the \py{WaitTimeCalculator}. Then we
7018
use the data, which consists of the arrival rate, \py{lam},
7019
and the number of passengers on the platform, to compute
7020
the posterior distribution.
7021
7022
Here's the definition of \py{Elapsed}:
7023
7024
\begin{code}
7025
class Elapsed(thinkbayes.Suite):
7026
7027
def Likelihood(self, data, hypo):
7028
x = hypo
7029
lam, k = data
7030
like = thinkbayes.EvalPoissonPmf(k, lam * x)
7031
return like
7032
\end{code}
7033
7034
As always, \py{Likelihood} takes a hypothesis and data, and
7035
computes the likelihood of the data under the hypothesis.
7036
In this case \py{hypo} is the elapsed time since the last train
7037
and \py{data} is a tuple of \py{lam} and the number of
7038
passengers.
7039
\index{likelihood}
7040
7041
The likelihood of the data is the probability of getting
7042
\py{k} arrivals in \py{x} time, given arrival rate
7043
\py{lam}. We compute that using the PMF of the Poisson
7044
distribution.
7045
\index{Poisson distribution}
7046
7047
Finally, here's the definition of \py{PredictWaitTime}:
7048
7049
\begin{code}
7050
def PredictWaitTime(pmf_zb, pmf_x):
7051
pmf_y = pmf_zb - pmf_x
7052
RemoveNegatives(pmf_y)
7053
return pmf_y
7054
\end{code}
7055
7056
\verb"pmf_zb" is the distribution of gaps between trains;
7057
\verb"pmf_x" is the distribution of elapsed time, based on
7058
the observed number of passengers. Since \py{y = zb - x},
7059
we can compute
7060
7061
\begin{code}
7062
pmf_y = pmf_zb - pmf_x
7063
\end{code}
7064
7065
The subtraction operator invokes \verb"Pmf.__sub__", which enumerates
7066
all pairs of \py{zb} and \py{x}, computes the differences, and adds
7067
the results to \verb"pmf_y".
7068
7069
The resulting Pmf includes some negative values, which we know are
7070
impossible. For example, if you arrive during a gap of 5 minutes, you
7071
can't wait more than 5 minutes. \py{RemoveNegatives} removes the
7072
impossible values from the distribution and renormalizes.
7073
7074
\begin{code}
7075
def RemoveNegatives(pmf):
7076
for val in pmf.Values():
7077
if val < 0:
7078
pmf.Remove(val)
7079
pmf.Normalize()
7080
\end{code}
7081
7082
Figure~\ref{fig.redline3} shows the results. The prior distribution
7083
of \py{x} is the same as the distribution of \py{y} in
7084
Figure~\ref{fig.redline2}. The posterior distribution of \py{x}
7085
shows that, after seeing 15 passengers on the platform, we believe
7086
that the time since the last train is probably 5-10 minutes. The
7087
predictive distribution of \py{y} indicates that we expect the next
7088
train in less than 5 minutes, with about 80\% confidence.
7089
\index{predictive distribution}
7090
7091
7092
\section{Estimating the arrival rate}
7093
7094
\begin{figure}
7095
% redline.py
7096
\centerline{\includegraphics[height=2.5in]{figs/redline1.pdf}}
7097
\caption{Prior and posterior distributions of \py{lam} based
7098
on five days of passenger data. }
7099
\label{fig.redline1}
7100
\end{figure}
7101
7102
The analysis so far has been based on the assumption that we know (1)
7103
the distribution of gaps and (2) the passenger arrival rate. Now we
7104
are ready to relax the second assumption.
7105
7106
Suppose that you just moved to Boston, so you don't know much about
7107
the passenger arrival rate on the Red Line. After a few days of
7108
commuting, you could make a guess, at least qualitatively. With
7109
a little more effort, you could estimate $\lam$ quantitatively.
7110
\index{arrival rate}
7111
7112
Each day when you arrive at the platform, you should note the
7113
time and the number of passengers waiting (if the platform is too
7114
big, you could choose a sample area). Then you should record your
7115
wait time and the
7116
number of new arrivals while you are waiting.
7117
7118
After five days, you might have data like this:
7119
%
7120
\begin{code}
7121
k1 y k2
7122
-- --- --
7123
17 4.6 9
7124
22 1.0 0
7125
23 1.4 4
7126
18 5.4 12
7127
4 5.8 11
7128
\end{code}
7129
%
7130
where \py{k1} is the number of passengers waiting when you arrive,
7131
\py{y} is your wait time in minutes, and \py{k2} is the number of
7132
passengers who arrive while you are waiting.
7133
7134
Over the course of one week, you waited 18 minutes and saw 36
7135
passengers arrive, so you would estimate that the arrival rate is
7136
2 passengers per minute. For practical purposes that estimate is
7137
good enough, but for the sake of completeness I
7138
will compute a posterior distribution for $\lam$ and show how
7139
to use that distribution in the rest of the analysis.
7140
7141
\py{ArrivalRate} is a \py{Suite} that represents hypotheses about
7142
$\lam$. As always, \py{Likelihood} takes a hypothesis and data,
7143
and computes the likelihood of the data under the hypothesis.
7144
7145
In this case the hypothesis is a value of $\lam$. The data is a
7146
pair, \py{y, k}, where \py{y} is a wait time and \py{k} is the
7147
number of passengers that arrived.
7148
7149
\begin{code}
7150
class ArrivalRate(thinkbayes.Suite):
7151
7152
def Likelihood(self, data, hypo):
7153
lam = hypo
7154
y, k = data
7155
like = thinkbayes.EvalPoissonPmf(k, lam * y)
7156
return like
7157
\end{code}
7158
7159
This \py{Likelihood} might look familiar; it
7160
is almost identical to \py{Elapsed.Likelihood} in
7161
Section~\ref{elapsed}. The difference is that in {\tt
7162
Elapsed.Likelihood} the hypothesis is \py{x}, the elapsed time; in
7163
\py{ArrivalRate.Likelihood} the hypothesis is \py{lam}, the arrival
7164
rate. But in both cases the likelihood is the probability of seeing
7165
\py{k} arrivals in some period of time, given \py{lam}.
7166
7167
\py{ArrivalRateEstimator} encapsulates the process of estimating
7168
$\lam$. The parameter, \verb"passenger_data", is a list
7169
of \py{k1, y, k2} tuples, as in the table above.
7170
\index{numpy}
7171
7172
\begin{code}
7173
class ArrivalRateEstimator(object):
7174
7175
def __init__(self, passenger_data):
7176
low, high = 0, 5
7177
n = 51
7178
hypos = numpy.linspace(low, high, n) / 60
7179
7180
self.prior_lam = ArrivalRate(hypos)
7181
7182
self.post_lam = self.prior_lam.Copy()
7183
for k1, y, k2 in passenger_data:
7184
self.post_lam.Update((y, k2))
7185
\end{code}
7186
7187
\verb"__init__" builds
7188
\py{hypos}, which is a sequence of hypothetical values for \py{lam},
7189
then builds the prior distribution, \verb"prior_lam".
7190
The \py{for} loop updates the prior with data, yielding the posterior
7191
distribution, \verb"post_lam".
7192
7193
Figure~\ref{fig.redline1} shows
7194
the prior and posterior distributions. As expected, the mean and
7195
median of the posterior are near the observed rate, 2 passengers per
7196
minute. But the spread of the posterior distribution captures our
7197
uncertainty about $\lam$ based on a small sample.
7198
7199
7200
\section{Incorporating uncertainty}
7201
7202
\begin{figure}
7203
% redline.py
7204
\centerline{\includegraphics[height=2.5in]{figs/redline4.pdf}}
7205
\caption{Predictive distributions of \py{y} for possible values
7206
of \py{lam}. }
7207
\label{fig.redline4}
7208
\end{figure}
7209
7210
Whenever there is uncertainty about one of the inputs to an analysis,
7211
we can take it into account by a process like this:
7212
\index{uncertainty}
7213
7214
\begin{enumerate}
7215
7216
\item Implement the analysis based on a deterministic value of the
7217
uncertain parameter (in this case $\lam$).
7218
7219
\item Compute the distribution of the uncertain parameter.
7220
7221
\item Run the analysis for each value of the parameter, and generate a
7222
set of predictive distributions.
7223
\index{predictive distribution}
7224
7225
\item Compute a mixture of the predictive distributions, using the
7226
weights from the distribution of the parameter.
7227
\index{mixture}
7228
7229
\end{enumerate}
7230
7231
We have already done steps (1) and (2). I wrote a class
7232
called \py{WaitMixtureEstimator} to handle steps (3) and (4).
7233
7234
\begin{code}
7235
class WaitMixtureEstimator(object):
7236
7237
def __init__(self, wtc, are, num_passengers=15):
7238
self.metapmf = thinkbayes.Pmf()
7239
7240
for lam, prob in sorted(are.post_lam.Items()):
7241
ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
7242
self.metapmf.Set(ete.pmf_y, prob)
7243
7244
self.mixture = thinkbayes.MakeMixture(self.metapmf)
7245
\end{code}
7246
7247
\py{wtc} is the \py{WaitTimeCalculator} that contains the
7248
distribution of \py{zb}. \py{are} is the \py{ArrivalTimeEstimator}
7249
that contains the distribution of \py{lam}.
7250
7251
The first line makes a meta-Pmf that maps from each possible
7252
distribution of \py{y} to its probability. For each value
7253
of \py{lam}, we use \py{ElapsedTimeEstimator} to
7254
compute the corresponding distribution of
7255
\py{y} and store it in the Meta-Pmf. Then
7256
we use \py{MakeMixture} to compute the mixture.
7257
\index{MakeMixture}
7258
\index{meta-Pmf}
7259
\index{mixture}
7260
7261
%For purposes of comparison, I also compute the distribution of
7262
%\py{y} based on a single point estimate of \py{lam}, which is
7263
%the mean of the posterior distribution.
7264
7265
Figure~\ref{fig.redline4} shows the results. The shaded lines
7266
in the background are the distributions of \py{y} for each value
7267
of \py{lam}, with line thickness that represents likelihood.
7268
The dark line is the mixture of these distributions.
7269
7270
In this case we could get a very similar result using a single point
7271
estimate of \py{lam}. So it was not necessary, for practical purposes,
7272
to include the uncertainty of the estimate.
7273
7274
In general, it is important to include variability if the system
7275
response is non-linear; that is, if small changes in the input can
7276
cause big changes in the output. In this case, posterior variability
7277
in \py{lam} is small and the system response is approximately
7278
linear for small perturbations.
7279
\index{non-linear}
7280
7281
7282
\section{Decision analysis}
7283
7284
\begin{figure}
7285
% redline.py
7286
\centerline{\includegraphics[height=2.5in]{figs/redline5.pdf}}
7287
\caption{Probability that wait time exceeds 15 minutes as
7288
a function of the number of passengers on the platform. }
7289
\label{fig.redline5}
7290
\end{figure}
7291
7292
At this point we can use the number of passengers on the platform
7293
to predict the distribution of wait times. Now
7294
let's get to the second part of the question: when should I stop
7295
waiting for the train and go catch a taxi?
7296
\index{decision analysis}
7297
7298
Remember that in the original scenario, I am trying to get to
7299
South Station to catch the commuter rail. Suppose I leave
7300
the office with enough time that I can wait 15 minutes
7301
and still make my connection at South Station.
7302
7303
In that case I would like to know the probability that \py{y} exceeds
7304
15 minutes as a function of \verb"num_passengers". It is easy enough
7305
to use the
7306
analysis from Section~\ref{elapsed} and run it for a range of
7307
\verb"num_passengers".
7308
7309
But there's a problem.
7310
The analysis is sensitive to the frequency of long delays, and
7311
because long delays are rare, it is hard to estimate
7312
their frequency.
7313
7314
I only have data from one week,
7315
and the longest delay I observed was 15 minutes. So I can't
7316
estimate the frequency of longer delays accurately.
7317
7318
However, I can use previous observations to make at least a coarse
7319
estimate. When I commuted by Red Line for a year, I saw three long
7320
delays caused by a signaling problem, a power outage, and ``police
7321
activity'' at another stop. So I estimate that there are about
7322
3 major delays per year.
7323
7324
But remember that my observations are biased. I am more likely
7325
to observe long delays because they affect a large number
7326
of passengers. So we should treat my observations as a sample
7327
of \py{zb} rather than \py{z}. Here's how we can do that.
7328
\index{observer bias}
7329
7330
During my year of commuting, I took the Red Line home about 220
7331
times. So I take the observed gap times, \verb"gap_times",
7332
generate a sample of 220 gaps, and compute their Pmf:
7333
7334
\begin{code}
7335
n = 220
7336
cdf_z = thinkbayes.MakeCdfFromList(gap_times)
7337
sample_z = cdf_z.Sample(n)
7338
pmf_z = thinkbayes.MakePmfFromList(sample_z)
7339
\end{code}
7340
7341
Next I bias \verb"pmf_z" to get the distribution of
7342
\py{zb}, draw a sample, and then add in delays of
7343
30, 40, and 50 minutes (expressed in seconds):
7344
7345
\begin{code}
7346
cdf_zp = BiasPmf(pmf_z).MakeCdf()
7347
sample_zb = cdf_zp.Sample(n) + [1800, 2400, 3000]
7348
\end{code}
7349
7350
\py{Cdf.Sample} is more efficient than \py{Pmf.Sample}, so it
7351
is usually faster to convert a Pmf to a Cdf before sampling.
7352
7353
Next I use the sample of \py{zb} to estimate a Pdf using
7354
KDE, and then convert the Pdf to a Pmf:
7355
7356
\begin{code}
7357
pdf_zb = thinkbayes.EstimatedPdf(sample_zb)
7358
xs = MakeRange(low=60)
7359
pmf_zb = pdf_zb.MakePmf(xs)
7360
\end{code}
7361
7362
Finally I unbias the distribution of \py{zb} to get the
7363
distribution of \py{z}, which I use to create the
7364
\py{WaitTimeCalculator}:
7365
7366
\begin{code}
7367
pmf_z = UnbiasPmf(pmf_zb)
7368
wtc = WaitTimeCalculator(pmf_z)
7369
\end{code}
7370
7371
This process is complicated, but
7372
all of the steps are operations we have seen before.
7373
Now we are ready to compute the probability of a long wait.
7374
7375
\begin{code}
7376
def ProbLongWait(num_passengers, minutes):
7377
ete = ElapsedTimeEstimator(wtc, lam, num_passengers)
7378
cdf_y = ete.pmf_y.MakeCdf()
7379
prob = 1 - cdf_y.Prob(minutes * 60)
7380
\end{code}
7381
7382
Given the number of passengers on the platform,
7383
\py{ProbLongWait}
7384
makes an \py{ElapsedTimeEstimator},
7385
extracts the distribution of wait time, and
7386
computes
7387
the probability that wait time
7388
exceeds \py{minutes}.
7389
7390
Figure~\ref{fig.redline5} shows the result. When the number of
7391
passengers is less than 20, we infer that the system is
7392
operating normally, so the probability of a long delay is small.
7393
If there are 30 passengers, we estimate that it has been 15
7394
minutes since the last train; that's longer than a normal delay,
7395
so we infer that something is wrong and expect longer delays.
7396
7397
If we are willing to accept a 10\% chance of missing the connection
7398
at South Station, we should stay and wait as long as there
7399
are fewer than 30 passengers, and take a taxi if there are more.
7400
7401
Or, to take this analysis one step further, we could quantify the cost
7402
of missing the connection and the cost of taking a taxi, then choose
7403
the threshold that minimizes expected cost.
7404
7405
\section{Discussion}
7406
7407
The analysis so far has been based on the assumption that the
7408
arrival rate of passengers is the same every day. For a commuter
7409
train during rush hour, that might not be a bad assumption, but
7410
there are some obvious exceptions. For example, if there is a special
7411
event nearby, a large number of people might arrive at the same time.
7412
In that case, the estimate of \py{lam} would be too low, so the
7413
estimates of \py{x} and \py{y} would be too high.
7414
7415
If special events are as common as major delays, it would
7416
be important to include them in the model. We could do that by
7417
extending the distribution of \py{lam} to include occasional
7418
large values.
7419
7420
We started with the assumption that we know
7421
distribution of \py{z}.
7422
As an alternative, a passenger could estimate \py{z}, but it would
7423
not be easy.
7424
As a passenger, you only
7425
observe only your own wait time, \py{y}. Unless you skip
7426
the first train and wait for the second, you don't
7427
observe the gap between trains, \py{z}.
7428
7429
However, we could make some inferences about \py{zb}. If we note
7430
the number of passengers waiting when we arrive, we can estimate
7431
the elapsed time since the last train, \py{x}. Then we observe
7432
\py{y}. If we add the posterior distribution of \py{x} to
7433
the observed \py{y}, we get a distribution that represents
7434
our posterior belief about the observed value of \py{zb}.
7435
7436
We can use this distribution to update our beliefs about the
7437
distribution of \py{zb}. Finally, we can compute the
7438
inverse of \py{BiasPmf} to get from the distribution of \py{zb}
7439
to the distribution of \py{z}.
7440
7441
I leave this analysis as an exercise for the
7442
reader. One suggestion: you should read Chapter~\ref{species} first.
7443
You can find the outline of
7444
a solution in \url{http://thinkbayes.com/redline.py}.
7445
For more information
7446
see Section~\ref{download}.
7447
7448
\section{Exercises}
7449
7450
\begin{exercise}
7451
This exercise is from
7452
MacKay, {\em Information Theory, Inference, and Learning Algorithms}:
7453
\index{MacKay, David}
7454
7455
\begin{quote}
7456
Unstable particles are emitted from a source and decay at a
7457
distance $x$, a real number that has an exponential probability
7458
distribution with [parameter] $\lam$. Decay events can only be
7459
observed if they occur in a window extending from $x=1$ cm to $x=20$
7460
cm. $N$ decays are observed at locations $\{ 1.5, 2, 3, 4, 5, 12 \}$
7461
cm. What is the posterior distribution of $\lam$?
7462
7463
\end{quote}
7464
7465
You can download a solution to this exercise from
7466
\url{http://thinkbayes.com/decay.py}.
7467
7468
\end{exercise}
7469
7470
7471
7472
\chapter{Hypothesis Testing}
7473
\label{hypotest}
7474
7475
\section{Back to the Euro problem}
7476
7477
In Section~\ref{euro} I presented a problem from MacKay's {\it Information
7478
Theory, Inference, and Learning Algorithms}:
7479
\index{MacKay, David}
7480
7481
\begin{quote}
7482
A statistical statement appeared in ``The Guardian" on Friday January 4, 2002:
7483
7484
\begin{quote}
7485
When spun on edge 250 times, a Belgian one-euro coin came
7486
up heads 140 times and tails 110. `It looks very suspicious
7487
to me,' said Barry Blight, a statistics lecturer at the London
7488
School of Economics. `If the coin were unbiased, the chance of
7489
getting a result as extreme as that would be less than 7\%.'
7490
\end{quote}
7491
7492
But do these data give evidence that the coin is biased rather than fair?
7493
\end{quote}
7494
7495
We estimated the probability that the coin would
7496
land face up, but we didn't really answer MacKay's question:
7497
Do the data give evidence that the coin is biased?
7498
\index{Euro problem}
7499
\index{evidence}
7500
7501
In Chapter~\ref{more} I proposed that data are in favor of
7502
a hypothesis if the data are more likely under the hypothesis than
7503
under the alternative or, equivalently, if the Bayes factor is greater
7504
than 1.
7505
\index{hypothesis testing}
7506
\index{Bayes factor}
7507
7508
In the Euro example, we have two hypotheses to consider: I'll use
7509
$F$ for the hypothesis that the coin is fair and $B$ for the hypothesis
7510
that it is biased.
7511
\index{fair coin}
7512
\index{biased coin}
7513
7514
If the coin is fair, it is easy to compute the likelihood of the
7515
data, \p{D|F}. In fact, we already wrote the function
7516
that does it.
7517
7518
\begin{code}
7519
def Likelihood(self, data, hypo):
7520
x = hypo / 100.0
7521
head, tails = data
7522
like = x**heads * (1-x)**tails
7523
return like
7524
\end{code}
7525
7526
To use it we can
7527
create a \py{Euro} suite and invoke
7528
\py{Likelihood}:
7529
7530
\begin{code}
7531
suite = Euro()
7532
likelihood = suite.Likelihood(data, 50)
7533
\end{code}
7534
7535
\p{D|F} is $5.5 \cdot 10^{-76}$, which doesn't tell us much except
7536
that the probability of seeing any particular dataset is very small.
7537
It takes two likelihoods to make a ratio, so we also have to
7538
compute \p{D|B}.
7539
7540
It is not obvious how to compute the likelihood of $B$, because
7541
it's not obvious what ``biased'' means.
7542
7543
One possibility is to cheat and look at the data before we define
7544
the hypothesis. In that case we would say that ``biased'' means that
7545
the probability of heads is 140/250.
7546
7547
\begin{code}
7548
actual_percent = 100.0 * 140 / 250
7549
likelihood = suite.Likelihood(data, actual_percent)
7550
\end{code}
7551
7552
This version of $B$ I call \verb"B_cheat"; the likelihood of
7553
\verb"b_cheat" is $34 \cdot 10^{-76}$ and the likelihood ratio is
7554
6.1. So we would say that the data are evidence in favor of this
7555
version of $B$.
7556
\index{evidence}
7557
7558
But using the data to formulate the hypothesis
7559
is obviously bogus. By that definition, any dataset would
7560
be evidence in favor of $B$, unless the observed percentage of heads
7561
is exactly 50\%.
7562
\index{bogus}
7563
7564
\section{Making a fair comparison}
7565
\label{suitelike}
7566
7567
To make a legitimate comparison, we have to define $B$ without looking
7568
at the data. So let's try a different definition. If you inspect
7569
a Belgian Euro coin, you might notice that the ``heads'' side is more
7570
prominent than the ``tails'' side. You might expect the shape to
7571
have some effect on
7572
$x$, but be unsure whether it makes heads more or less
7573
likely. So you might say ``I think the coin is biased so that
7574
$x$ is either 0.6 or 0.4, but I am not sure which.''
7575
7576
We can think of this version, which I'll call \verb"B_two"
7577
as a hypothesis made up of two
7578
sub-hypotheses. We can compute the likelihood for each
7579
sub-hypothesis and then compute the average likelihood.
7580
7581
\begin{code}
7582
like40 = suite.Likelihood(data, 40)
7583
like60 = suite.Likelihood(data, 60)
7584
likelihood = 0.5 * like40 + 0.5 * like60
7585
\end{code}
7586
7587
The likelihood ratio (or Bayes factor) for \verb"b_two" is 1.3, which
7588
means the data provide weak evidence in favor of \verb"b_two".
7589
\index{evidence}
7590
\index{likelihood ratio}
7591
\index{Bayes factor}
7592
7593
More generally, suppose you suspect that the coin is biased, but you
7594
have no clue about the value of $x$. In that case you might build a
7595
Suite, which I call \verb"b_uniform", to represent sub-hypotheses from
7596
0 to 100.
7597
7598
\begin{code}
7599
b_uniform = Euro(xrange(0, 101))
7600
b_uniform.Remove(50)
7601
b_uniform.Normalize()
7602
\end{code}
7603
7604
I initialize \verb"b_uniform" with values from 0 to 100.
7605
I removed the sub-hypothesis that $x$ is 50\%, because if
7606
$x$ is 50\% the coin is fair, but it has almost no
7607
effect on the result whether you remove it or not.
7608
7609
To compute the likelihood of
7610
\verb"b_uniform" we compute the likelihood of each sub-hypothesis
7611
and accumulate a weighted average.
7612
7613
\begin{code}
7614
def SuiteLikelihood(suite, data):
7615
total = 0
7616
for hypo, prob in suite.Items():
7617
like = suite.Likelihood(data, hypo)
7618
total += prob * like
7619
return total
7620
\end{code}
7621
7622
The likelihood ratio for \verb"b_uniform" is 0.47, which means
7623
that the data are weak evidence against \verb"b_uniform",
7624
compared to $F$.
7625
\index{likelihood}
7626
7627
If you think about the computation performed by
7628
\verb"SuiteLikelihood", you might notice that it is similar to an
7629
update. To refresh your memory, here's the \py{Update} function:
7630
7631
\begin{code}
7632
def Update(self, data):
7633
for hypo in self.Values():
7634
like = self.Likelihood(data, hypo)
7635
self.Mult(hypo, like)
7636
return self.Normalize()
7637
\end{code}
7638
7639
And here's \py{Normalize}:
7640
7641
\begin{code}
7642
def Normalize(self):
7643
total = self.Total()
7644
7645
factor = 1.0 / total
7646
for x in self.d:
7647
self.d[x] *= factor
7648
7649
return total
7650
\end{code}
7651
7652
The return value from \py{Normalize} is the total of the
7653
probabilities in the Suite, which is the average of the likelihoods
7654
for the sub-hypotheses, weighted by the prior probabilities. And {\tt
7655
Update} passes this value along, so instead of using {\tt
7656
SuiteLikelihood}, we could compute the likelihood of
7657
\verb"b_uniform" like this:
7658
7659
\begin{code}
7660
likelihood = b_uniform.Update(data)
7661
\end{code}
7662
7663
7664
7665
\section{The triangle prior}
7666
7667
In Chapter~\ref{more} we also considered a triangle-shaped prior that
7668
gives higher probability to values of $x$ near 50\%. If we think of
7669
this prior as a suite of sub-hypotheses, we can compute its likelihood
7670
like this:
7671
\index{triangle distribution}
7672
7673
\begin{code}
7674
b_triangle = TrianglePrior()
7675
likelihood = b_triangle.Update(data)
7676
\end{code}
7677
7678
The likelihood ratio for \verb"b_triangle" is 0.84, compared to $F$, so
7679
again we would say that the data are weak evidence against $B$.
7680
\index{evidence}
7681
7682
The following table shows the priors we have considered, the
7683
likelihood of each, and the likelihood ratio (or Bayes factor)
7684
relative to $F$.
7685
\index{likelihood ratio}
7686
\index{Bayes factor}
7687
7688
\begin{tabular}{|l|r|r|}
7689
\hline
7690
Hypothesis & Likelihood & Bayes \\
7691
& $\times 10^{-76}$ & Factor \\
7692
\hline
7693
$F$ & 5.5 & -- \\
7694
\verb"B_cheat" & 34 & 6.1 \\
7695
\verb"B_two" & 7.4 & 1.3 \\
7696
\verb"B_uniform" & 2.6 & 0.47 \\
7697
\verb"B_triangle" & 4.6 & 0.84 \\
7698
\hline
7699
\end{tabular}
7700
7701
Depending on which definition we choose, the data might provide
7702
evidence for or against the hypothesis that the coin is biased, but
7703
in either case it is relatively weak evidence.
7704
7705
In summary, we can use Bayesian hypothesis testing to compare the
7706
likelihood of $F$ and $B$, but we have to do some work to specify
7707
precisely what $B$ means. This specification depends on background
7708
information about coins and their behavior when spun, so people
7709
could reasonably disagree about the right definition.
7710
7711
My presentation of this example follows
7712
David MacKay's discussion, and comes to the same conclusion.
7713
You can download the code I used in this chapter from
7714
\url{http://thinkbayes.com/euro3.py}.
7715
For more information
7716
see Section~\ref{download}.
7717
7718
\section{Discussion}
7719
7720
The Bayes factor for \verb"B_uniform" is 0.47, which means
7721
that the data provide evidence against this hypothesis, compared
7722
to $F$. In the previous section I characterized this evidence
7723
as ``weak,'' but didn't say why.
7724
\index{evidence}
7725
7726
Part of the answer is historical. Harold Jeffreys, an early
7727
proponent of Bayesian statistics, suggested a scale for
7728
interpreting Bayes factors:
7729
7730
\begin{tabular}{|l|l|}
7731
\hline
7732
Bayes & Strength \\
7733
Factor & \\
7734
\hline
7735
1 -- 3 & Barely worth mentioning \\
7736
3 -- 10 & Substantial \\
7737
10 -- 30 & Strong \\
7738
30 -- 100 & Very strong \\
7739
$>$ 100 & Decisive \\
7740
\hline
7741
\end{tabular}
7742
7743
In the example, the Bayes factor is 0.47 in favor of \verb"B_uniform",
7744
so it is 2.1 in favor of $F$, which Jeffreys would consider ``barely
7745
worth mentioning.'' Other authors have suggested variations on the
7746
wording. To avoid arguing about adjectives, we could think about odds
7747
instead.
7748
7749
If your prior odds are 1:1, and you see evidence with Bayes
7750
factor 2, your posterior odds are 2:1. In terms of probability,
7751
the data changed your degree of belief from 50\% to 66\%. For
7752
most real world problems, that change would be small relative
7753
to modeling errors and other sources of uncertainty.
7754
7755
On the other hand, if you had seen evidence with Bayes
7756
factor 100, your posterior odds would be 100:1 or more than 99\%.
7757
Whether or not you agree that such evidence is ``decisive,''
7758
it is certainly strong.
7759
7760
%TODO: postpone this section
7761
\section{The beta distribution}
7762
\label{beta}
7763
7764
\index{beta distribution}
7765
There is one more optimization that solves this problem
7766
even faster.
7767
7768
So far we have used a Pmf object to represent a discrete set of
7769
values for \py{x}. Now we will use a continuous
7770
distribution, specifically the beta distribution (see
7771
\url{http://en.wikipedia.org/wiki/Beta_distribution}).
7772
\index{continuous distribution}
7773
7774
The beta distribution is defined on the interval from 0 to 1
7775
(including both), so it is a natural choice for describing
7776
proportions and probabilities. But wait, it gets better.
7777
7778
%TODO: explain the binomial distribution in the previous section
7779
7780
It turns out that if you do a Bayesian update with a binomial
7781
likelihood function, which is what we did in the previous section, the beta
7782
distribution is a {\bf conjugate prior}. That means that if the prior
7783
distribution for \py{x} is a beta distribution, the posterior is also
7784
a beta distribution. But wait, it gets even better.
7785
\index{binomial likelihood function}
7786
\index{conjugate prior}
7787
7788
The shape of the beta distribution depends on two parameters, written
7789
$\alpha$ and $\beta$, or \py{alpha} and \py{beta}. If the prior
7790
is a beta distribution with parameters \py{alpha} and \py{beta}, and
7791
we see data with \py{h} heads and \py{t} tails, the posterior is a
7792
beta distribution with parameters \py{alpha+h} and \py{beta+t}. In
7793
other words, we can do an update with two additions.
7794
\index{parameter}
7795
7796
So that's great, but it only works if we can find a beta distribution
7797
that is a good choice for a prior. Fortunately, for many realistic
7798
priors there is a beta distribution that is at least a good
7799
approximation, and for a uniform prior there is a perfect match. The
7800
beta distribution with \py{alpha=1} and \py{beta=1} is uniform from
7801
0 to 1.
7802
7803
Let's see how we can take advantage of all this.
7804
\py{thinkbayes.py} provides
7805
a class that represents a beta distribution:
7806
\index{Beta object}
7807
7808
\begin{code}
7809
class Beta(object):
7810
7811
def __init__(self, alpha=1, beta=1):
7812
self.alpha = alpha
7813
self.beta = beta
7814
\end{code}
7815
7816
By default \verb"__init__" makes a uniform distribution.
7817
\py{Update} performs a Bayesian update:
7818
7819
\begin{code}
7820
def Update(self, data):
7821
heads, tails = data
7822
self.alpha += heads
7823
self.beta += tails
7824
\end{code}
7825
7826
\py{data} is a pair of integers representing the number of
7827
heads and tails.
7828
7829
So we have yet another way to solve the Euro problem:
7830
7831
\begin{code}
7832
beta = thinkbayes.Beta()
7833
beta.Update((140, 110))
7834
print beta.Mean()
7835
\end{code}
7836
7837
\py{Beta} provides \py{Mean}, which
7838
computes a simple function of \py{alpha}
7839
and \py{beta}:
7840
7841
\begin{code}
7842
def Mean(self):
7843
return float(self.alpha) / (self.alpha + self.beta)
7844
\end{code}
7845
7846
For the Euro problem the posterior mean is 56\%, which is the
7847
same result we got using Pmfs.
7848
7849
\py{Beta} also provides \py{EvalPdf}, which evaluates
7850
the probability density
7851
function (PDF) of the beta distribution:
7852
\index{probability density function}
7853
\index{PDF}
7854
7855
\begin{code}
7856
def EvalPdf(self, x):
7857
return x**(self.alpha-1) * (1-x)**(self.beta-1)
7858
\end{code}
7859
7860
Finally, \py{Beta} provides \py{MakePmf}, which
7861
uses \py{EvalPdf} to generate a discrete approximation
7862
of the beta distribution.
7863
7864
%This expression might look familiar. Here's {\tt
7865
% thinkbayes.EvalBinomialPmf}
7866
7867
%\begin{code}
7868
%def EvalBinomialPmf(x, yes, no):
7869
% return x**yes * (1-x)**no
7870
%\end{code}
7871
7872
%It's the same function, but in \py{EvalPdf}, we think of \py{x} as a
7873
%random variable and \py{alpha} and \py{beta} as parameters; in {\tt
7874
% EvalBinomialPmf}, \py{x} is the parameter, and \py{yes} and {\tt
7875
% no} are random variables. Distributions like these that share the
7876
%same PDF are called {\bf conjugate distributions}.
7877
%\index{conjugate distribution}
7878
7879
7880
\section{Exercises}
7881
7882
%TODO: Revisit the Poincare problem; how much evidence would
7883
% Poincare have at the end of the year to distinguish between
7884
% N(1000, sigma) and Max_4 N(950, sigma2)?
7885
7886
\begin{exercise}
7887
Some people believe in the existence of extra-sensory
7888
perception (ESP); for example, the ability of some people to guess
7889
the value of an unseen playing card with probability better
7890
than chance.
7891
\index{ESP}
7892
\index{extra-sensory perception}
7893
7894
What is your prior degree of belief in this kind of ESP?
7895
Do you think it is as likely to exist as not? Or are you
7896
more skeptical about it? Write down your prior odds.
7897
7898
Now compute the strength of the evidence it would take to
7899
convince you that ESP is at least 50\% likely to exist.
7900
What Bayes factor would be needed to make you 90\% sure
7901
that ESP exists?
7902
7903
%TODO: figure out where to talk about Cromwell's rule
7904
Also, notice that in a Bayesian update, we multiply
7905
each prior probability by a likelihood, so if \p{H} is 0,
7906
\p{H|D} is also 0, regardless of $D$. In the Euro problem,
7907
if you are convinced that \py{x} is less than 50\%, and you assign
7908
probability 0 to all other hypotheses, no amount of data will
7909
convince you otherwise.
7910
\index{Euro problem}
7911
7912
This observation is the basis of {\bf Cromwell's rule}, which is the
7913
recommendation that you should avoid giving a prior probability of
7914
0 to any hypothesis that is even remotely possible
7915
(see \url{http://en.wikipedia.org/wiki/Cromwell's_rule}).
7916
\index{Cromwell's rule}
7917
7918
Cromwell's rule is named after Oliver Cromwell, who wrote, ``I beseech
7919
you, in the bowels of Christ, think it possible that you may be
7920
mistaken.'' For Bayesians, this turns out to be good advice (even if
7921
it's a little overwrought).
7922
\index{Cromwell, Oliver}
7923
\end{exercise}
7924
7925
7926
\begin{exercise}
7927
Suppose that your answer to the previous question is 1000;
7928
that is, evidence with Bayes factor 1000 in favor of ESP would
7929
be sufficient to change your mind.
7930
7931
Now suppose that you read a paper in a respectable peer-reviewed
7932
scientific journal that presents evidence with Bayes factor 1000 in
7933
favor of ESP. Would that change your mind?
7934
7935
If not, how do you resolve the apparent contradiction?
7936
You might find it helpful to read about David Hume's article, ``Of
7937
Miracles,'' at \url{http://en.wikipedia.org/wiki/Of_Miracles}.
7938
\index{Hume, David}
7939
7940
\end{exercise}
7941
7942
7943
7944
\chapter{Evidence}
7945
\label{evidence}
7946
7947
%TODO: Make this chapter about dynamic testing; check if it is
7948
% optimal to chose questions where the respondent has a 50/50
7949
% chance.
7950
7951
\section{Interpreting SAT scores}
7952
7953
Suppose you are the Dean of Admission at a small engineering
7954
college in Massachusetts, and you are considering two candidates,
7955
Alice and Bob, whose qualifications are similar in many ways,
7956
with the exception that Alice got a higher score on the Math
7957
portion of the SAT, a standardized test intended to measure
7958
preparation for college-level work in mathematics.
7959
\index{SAT}
7960
\index{standardized test}
7961
7962
If Alice got 780 and Bob got a 740 (out of a possible 800), you might
7963
want to know whether that difference is evidence that Alice is better
7964
prepared than Bob, and what the strength of that evidence is.
7965
\index{evidence}
7966
7967
Now in reality, both scores are very good, and both
7968
candidates are probably well prepared for college math. So
7969
the real Dean of Admission would probably suggest that we choose
7970
the candidate who best demonstrates the other skills and
7971
attitudes we look for in students. But as an example of
7972
Bayesian hypothesis testing, let's stick with a narrower question:
7973
``How strong is the evidence that Alice is better prepared
7974
than Bob?''
7975
7976
To answer that question, we need to make some modeling decisions.
7977
I'll start with a simplification I know is wrong; then we'll come back
7978
and improve the model. I pretend, temporarily, that
7979
all SAT questions are equally difficult. Actually, the designers of
7980
the SAT choose questions with a range of difficulty, because that
7981
improves the ability to measure statistical differences between
7982
test-takers.
7983
\index{modeling}
7984
7985
But if we choose a model where all questions are equally difficult, we
7986
can define a characteristic, \verb"p_correct", for each test-taker,
7987
which is the probability of answering any question correctly. This
7988
simplification makes it easy to compute the likelihood of a given
7989
score.
7990
7991
7992
\section{The scale}
7993
7994
In order to understand SAT scores, we have to understand the scoring
7995
and scaling process. Each test-taker gets a raw score based on the
7996
number of correct and incorrect questions. The raw score is converted
7997
to a scaled score in the range 200--800.
7998
\index{scaled score}
7999
8000
In 2009, there were 54 questions on the math SAT. The raw score
8001
for each test-taker is the number of questions answered correctly
8002
minus a penalty of $1/4$ point for each question answered incorrectly.
8003
8004
The College Board, which administers the SAT, publishes the
8005
map from raw scores to scaled scores. I have downloaded that
8006
data and wrapped it in an Interpolator object that provides a forward
8007
lookup (from raw score to scaled) and a reverse lookup (from scaled
8008
score to raw).
8009
\index{College Board}
8010
8011
You can download the code for this example from
8012
\url{http://thinkbayes.com/sat.py}.
8013
For more information
8014
see Section~\ref{download}.
8015
8016
\section{The prior}
8017
8018
The College Board also publishes the distribution of scaled scores
8019
for all test-takers. If we convert each scaled score to a raw score,
8020
and divide by the number of questions, the result is an estimate
8021
of \verb"p_correct".
8022
So we can use the distribution of raw scores to model the
8023
prior distribution of \verb"p_correct".
8024
8025
Here is the code that reads and processes the data:
8026
8027
\begin{code}
8028
class Exam(object):
8029
8030
def __init__(self):
8031
self.scale = ReadScale()
8032
scores = ReadRanks()
8033
score_pmf = thinkbayes.MakePmfFromDict(dict(scores))
8034
self.raw = self.ReverseScale(score_pmf)
8035
self.max_score = max(self.raw.Values())
8036
self.prior = DivideValues(self.raw, self.max_score)
8037
\end{code}
8038
8039
\py{Exam} encapsulates the information we have about the exam.
8040
\py{ReadScale} and \py{ReadRanks} read files and return
8041
objects that contain the data:
8042
\py{self.scale} is the \py{Interpolator} that converts
8043
from raw to scaled scores and back; \py{scores} is a list
8044
of (score, frequency) pairs.
8045
8046
\verb"score_pmf" is the Pmf of
8047
scaled scores. \py{self.raw} is the Pmf of raw scores, and
8048
\py{self.prior} is the Pmf of \verb"p_correct".
8049
8050
\begin{figure}
8051
% sat.py
8052
\centerline{\includegraphics[height=2.5in]{figs/sat_prior.pdf}}
8053
\caption{Prior distribution of \py{p_correct} for SAT test-takers.}
8054
\label{fig.satprior}
8055
\end{figure}
8056
8057
Figure~\ref{fig.satprior} shows the prior distribution of
8058
\verb"p_correct". This distribution is approximately Gaussian, but it
8059
is compressed at the extremes. By design, the SAT has the most power
8060
to discriminate between test-takers within two standard deviations of
8061
the mean, and less power outside that range.
8062
\index{Gaussian distribution}
8063
8064
For each test-taker, I define a Suite called \py{Sat} that
8065
represents the distribution of \verb"p_correct". Here's the definition:
8066
8067
\begin{code}
8068
class Sat(thinkbayes.Suite):
8069
8070
def __init__(self, exam, score):
8071
thinkbayes.Suite.__init__(self)
8072
8073
self.exam = exam
8074
self.score = score
8075
8076
# start with the prior distribution
8077
for p_correct, prob in exam.prior.Items():
8078
self.Set(p_correct, prob)
8079
8080
# update based on an exam score
8081
self.Update(score)
8082
\end{code}
8083
8084
\verb"__init__" takes an Exam object and a scaled score. It makes a
8085
copy of the prior distribution and then updates itself based on the
8086
exam score.
8087
8088
As usual, we inherit \py{Update} from \py{Suite} and provide
8089
\py{Likelihood}:
8090
8091
\begin{code}
8092
def Likelihood(self, data, hypo):
8093
p_correct = hypo
8094
score = data
8095
8096
k = self.exam.Reverse(score)
8097
n = self.exam.max_score
8098
like = thinkbayes.EvalBinomialPmf(k, n, p_correct)
8099
return like
8100
\end{code}
8101
8102
\py{hypo} is a hypothetical
8103
value of \verb"p_correct", and \py{data} is a scaled score.
8104
8105
To keep things simple, I interpret the raw score as the number of
8106
correct answers, ignoring the penalty for wrong answers. With
8107
this simplification, the likelihood is given by the binomial
8108
distribution, which computes the probability of $k$ correct
8109
responses out of $n$ questions.
8110
\index{binomial distribution}
8111
\index{raw score}
8112
8113
8114
\section{Posterior}
8115
8116
\begin{figure}
8117
% sat.py
8118
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_p_corr.pdf}}
8119
\caption{Posterior distributions of \py{p_correct} for Alice and Bob.}
8120
\label{fig.satposterior1}
8121
\end{figure}
8122
8123
Figure~\ref{fig.satposterior1} shows the posterior distributions
8124
of \verb"p_correct" for Alice and Bob based on their exam scores.
8125
We can see that they overlap, so it is possible that \verb"p_correct"
8126
is actually higher for Bob, but it seems unlikely.
8127
8128
Which brings us back to the original question, ``How strong is the
8129
evidence that Alice is better prepared than Bob?'' We can use the
8130
posterior distributions of \verb"p_correct" to answer this question.
8131
8132
To formulate the question in terms of Bayesian hypothesis testing,
8133
I define two hypotheses:
8134
8135
\begin{itemize}
8136
8137
\item $A$: \verb"p_correct" is higher for Alice than for Bob.
8138
8139
\item $B$: \verb"p_correct" is higher for Bob than for Alice.
8140
8141
\end{itemize}
8142
8143
To compute the likelihood of $A$, we can enumerate all pairs of values
8144
from the posterior distributions and add up the total probability of
8145
the cases where \verb"p_correct" is higher for Alice than for Bob.
8146
And we already have a function, \verb"thinkbayes.PmfProbGreater",
8147
that does that.
8148
8149
So we can define a Suite that computes the posterior probabilities
8150
of $A$ and $B$:
8151
8152
\begin{code}
8153
class TopLevel(thinkbayes.Suite):
8154
8155
def Update(self, data):
8156
a_sat, b_sat = data
8157
8158
a_like = thinkbayes.PmfProbGreater(a_sat, b_sat)
8159
b_like = thinkbayes.PmfProbLess(a_sat, b_sat)
8160
c_like = thinkbayes.PmfProbEqual(a_sat, b_sat)
8161
8162
a_like += c_like / 2
8163
b_like += c_like / 2
8164
8165
self.Mult('A', a_like)
8166
self.Mult('B', b_like)
8167
8168
self.Normalize()
8169
\end{code}
8170
8171
Usually when we define a new Suite, we inherit \py{Update}
8172
and provide \py{Likelihood}. In this case I override \py{Update},
8173
because it is easier to evaluate the likelihood of both
8174
hypotheses at the same time.
8175
8176
The data passed to \py{Update} are Sat objects that represent
8177
the posterior distributions of \verb"p_correct".
8178
8179
\verb"a_like" is the total probability that
8180
\verb"p_correct" is higher for Alice; \verb"b_like" is that
8181
probability that it is higher for Bob.
8182
8183
\verb"c_like" is the probability that they are ``equal,'' but this
8184
equality is an artifact of the decision to model \verb"p_correct" with
8185
a set of discrete values. If we use more values, \verb"c_like"
8186
is smaller, and in the extreme, if \verb"p_correct" is
8187
continuous, \verb"c_like" is zero. So I treat \verb"c_like" as
8188
a kind of round-off error and split it evenly between \verb"a_like"
8189
and \verb"b_like".
8190
8191
Here is the code that creates \py{TopLevel} and updates it:
8192
8193
\begin{code}
8194
exam = Exam()
8195
a_sat = Sat(exam, 780)
8196
b_sat = Sat(exam, 740)
8197
8198
top = TopLevel('AB')
8199
top.Update((a_sat, b_sat))
8200
top.Print()
8201
\end{code}
8202
8203
The likelihood of $A$ is 0.79 and the likelihood of $B$ is 0.21. The
8204
likelihood ratio (or Bayes factor) is 3.8, which means that these test
8205
scores are evidence that Alice is better than Bob at answering SAT
8206
questions. If we believed, before seeing the test scores, that $A$
8207
and $B$ were equally likely, then after seeing the scores we should
8208
believe that the probability of $A$ is 79\%, which means there is
8209
still a 21\% chance that Bob is actually better prepared.
8210
\index{likelihood ratio}
8211
\index{Bayes factor}
8212
8213
8214
\section{A better model}
8215
8216
Remember that the analysis we have done so far is based on
8217
the simplification that all SAT questions are equally difficult.
8218
In reality, some are easier than others, which means that the
8219
difference between Alice and Bob might be even smaller.
8220
8221
But how big is the modeling error? If it is small, we conclude
8222
that the first model---based on the simplification that all questions
8223
are equally difficult---is good enough. If it's large,
8224
we need a better model.
8225
\index{modeling error}
8226
8227
In the next few sections, I develop a better model and
8228
discover (spoiler alert!) that the modeling error is small. So if
8229
you are satisfied with the simple model, you can skip to the next
8230
chapter. If you want to see how the more realistic model works,
8231
read on...
8232
8233
\begin{itemize}
8234
8235
\item Assume that each test-taker has some
8236
degree of \py{efficacy}, which measures their
8237
ability to answer SAT questions.
8238
\index{efficacy}
8239
8240
\item Assume that each question has some level of
8241
\py{difficulty}.
8242
8243
\item Finally, assume that the chance that a test-taker answers a
8244
question correctly is related to \py{efficacy} and \py{difficulty}
8245
according to this function:
8246
8247
\begin{code}
8248
def ProbCorrect(efficacy, difficulty, a=1):
8249
return 1 / (1 + math.exp(-a * (efficacy - difficulty)))
8250
\end{code}
8251
8252
\end{itemize}
8253
8254
This function is a simplified version of the curve used in {\bf item
8255
response theory}, which you can read about at
8256
\url{http://en.wikipedia.org/wiki/Item_response_theory}. {\tt
8257
efficacy} and \py{difficulty} are considered to be on the same
8258
scale, and the probability of getting a question right depends only on
8259
the difference between them.
8260
\index{item response theory}
8261
8262
When \py{efficacy} and \py{difficulty} are equal, the
8263
probability of getting the question right is 50\%. As
8264
\py{efficacy} increases, this probability approaches 100\%.
8265
As it decreases (or as \py{difficulty} increases), the
8266
probability approaches 0\%.
8267
8268
Given the distribution of \py{efficacy} across test-takers
8269
and the distribution of \py{difficulty} across questions, we
8270
can compute the expected distribution of raw scores. We'll do that
8271
in two steps. First, for a person with given \py{efficacy},
8272
we'll compute the distribution of raw scores.
8273
8274
\begin{code}
8275
def PmfCorrect(efficacy, difficulties):
8276
pmf0 = thinkbayes.Pmf([0])
8277
8278
ps = [ProbCorrect(efficacy, diff) for diff in difficulties]
8279
pmfs = [BinaryPmf(p) for p in ps]
8280
dist = sum(pmfs, pmf0)
8281
return dist
8282
\end{code}
8283
8284
\py{difficulties} is a list of difficulties, one for each question.
8285
\py{ps} is a list of probabilities, and \py{pmfs} is a list of
8286
two-valued Pmf objects; here's the function that makes them:
8287
8288
\begin{code}
8289
def BinaryPmf(p):
8290
pmf = thinkbayes.Pmf()
8291
pmf.Set(1, p)
8292
pmf.Set(0, 1-p)
8293
return pmf
8294
\end{code}
8295
8296
\py{dist} is the sum of these Pmfs. Remember from Section~\ref{addends}
8297
that when we add up Pmf objects, the result is the distribution
8298
of the sums. In order to use Python's \py{sum} to add up Pmfs,
8299
we have to provide \py{pmf0} which is the identity for Pmfs,
8300
so \py{pmf + pmf0} is always \py{pmf}.
8301
8302
If we know a person's efficacy, we can compute their distribution
8303
of raw scores. For a group of people with a different efficacies, the
8304
resulting distribution of raw scores is a mixture. Here's the code
8305
that computes the mixture:
8306
8307
\begin{code}
8308
# class Exam:
8309
8310
def MakeRawScoreDist(self, efficacies):
8311
pmfs = thinkbayes.Pmf()
8312
for efficacy, prob in efficacies.Items():
8313
scores = PmfCorrect(efficacy, self.difficulties)
8314
pmfs.Set(scores, prob)
8315
8316
mix = thinkbayes.MakeMixture(pmfs)
8317
return mix
8318
\end{code}
8319
8320
\py{MakeRawScoreDist} takes \py{efficacies}, which is a Pmf that
8321
represents the distribution of efficacy across test-takers. I assume
8322
it is Gaussian with mean 0 and standard deviation 1.5. This
8323
choice is mostly arbitrary. The probability of getting a question
8324
correct depends on the difference between efficacy and difficulty, so
8325
we can choose the units of efficacy and then calibrate the units of
8326
difficulty accordingly. \index{Gaussian distribution}
8327
8328
\py{pmfs} is a meta-Pmf that contains one Pmf for each level of
8329
efficacy, and maps to the fraction of test-takers at that level. {\tt
8330
MakeMixture} takes the meta-pmf and computes the distribution of the
8331
mixture (see Section~\ref{mixture}). \index{meta-Pmf}
8332
\index{MakeMixture}
8333
8334
8335
\section{Calibration}
8336
8337
If we were given the distribution of difficulty, we could use
8338
\verb"MakeRawScoreDist" to compute the distribution of raw scores.
8339
But for us the problem is the other way around: we are given the
8340
distribution of raw scores and we want to infer the distribution of
8341
difficulty.
8342
8343
\begin{figure}
8344
% sat.py
8345
\centerline{\includegraphics[height=2.5in]{figs/sat_calibrate.pdf}}
8346
\caption{Actual distribution of raw scores and a model to fit it.}
8347
\label{fig.satcalibrate}
8348
\end{figure}
8349
8350
I assume that the distribution of difficulty is uniform with
8351
parameters \py{center} and \py{width}. \py{MakeDifficulties}
8352
makes a list of difficulties with these parameters.
8353
\index{numpy}
8354
8355
\begin{code}
8356
def MakeDifficulties(center, width, n):
8357
low, high = center-width, center+width
8358
return numpy.linspace(low, high, n)
8359
\end{code}
8360
8361
By trying out a few combinations, I found that
8362
\py{center=-0.05} and \py{width=1.8} yield a distribution
8363
of raw scores similar to the actual data, as shown in
8364
Figure~\ref{fig.satcalibrate}.
8365
\index{calibration}
8366
8367
So, assuming that the distribution of difficulty is uniform,
8368
its range is approximately
8369
\py{-1.85} to \py{1.75}, given that
8370
efficacy is Gaussian with mean 0 and standard deviation 1.5.
8371
\index{Gaussian distribution}
8372
8373
The following table shows the range of \py{ProbCorrect} for
8374
test-takers at different levels of efficacy:
8375
8376
\begin{tabular}{|r|r|r|r|}
8377
\hline
8378
& \multicolumn{3}{|c|}{Difficulty} \\
8379
\hline
8380
Efficacy & -1.85 & -0.05 & 1.75 \\
8381
\hline
8382
3.00 & 0.99 & 0.95 & 0.78 \\
8383
1.50 & 0.97 & 0.82 & 0.44 \\
8384
0.00 & 0.86 & 0.51 & 0.15 \\
8385
-1.50 & 0.59 & 0.19 & 0.04 \\
8386
-3.00 & 0.24 & 0.05 & 0.01 \\
8387
\hline
8388
\end{tabular}
8389
8390
Someone with efficacy 3 (two standard deviations above
8391
the mean) has a 99\% chance of answering the easiest questions on
8392
the exam, and a 78\% chance of answering the hardest. On the other
8393
end of the range, someone two standard deviations below the mean
8394
has only a 24\% chance of answering the easiest questions.
8395
8396
8397
\section{Posterior distribution of efficacy}
8398
8399
\begin{figure}
8400
% sat.py
8401
\centerline{\includegraphics[height=2.5in]{figs/sat_posteriors_eff.pdf}}
8402
\caption{Posterior distributions of efficacy for Alice and Bob.}
8403
\label{fig.satposterior2}
8404
\end{figure}
8405
8406
Now that the model is calibrated, we can compute the posterior
8407
distribution of efficacy for Alice and Bob. Here is a version of the
8408
Sat class that uses the new model:
8409
8410
\begin{code}
8411
class Sat2(thinkbayes.Suite):
8412
8413
def __init__(self, exam, score):
8414
self.exam = exam
8415
self.score = score
8416
8417
# start with the Gaussian prior
8418
efficacies = thinkbayes.MakeGaussianPmf(0, 1.5, 3)
8419
thinkbayes.Suite.__init__(self, efficacies)
8420
8421
# update based on an exam score
8422
self.Update(score)
8423
\end{code}
8424
8425
\verb"Update" invokes
8426
\verb"Likelihood", which computes the likelihood of a given test score
8427
for a hypothetical level of efficacy.
8428
8429
\begin{code}
8430
def Likelihood(self, data, hypo):
8431
efficacy = hypo
8432
score = data
8433
raw = self.exam.Reverse(score)
8434
8435
pmf = self.exam.PmfCorrect(efficacy)
8436
like = pmf.Prob(raw)
8437
return like
8438
\end{code}
8439
8440
\py{pmf} is the distribution of raw scores for a test-taker
8441
with the given efficacy; \py{like} is the probability of
8442
the observed score.
8443
8444
Figure~\ref{fig.satposterior2} shows the posterior distributions
8445
of efficacy for Alice and Bob. As expected, the location
8446
of Alice's distribution is farther to the right, but again there
8447
is some overlap.
8448
8449
Using \py{TopLevel} again, we compare $A$, the
8450
hypothesis that Alice's efficacy is higher, and $B$, the
8451
hypothesis that Bob's is higher. The likelihood ratio is
8452
3.4, a bit smaller than what we got from the simple model (3.8).
8453
So this model indicates that the data are evidence in favor
8454
of $A$, but a little weaker than the previous estimate.
8455
8456
If our prior belief is that $A$ and $B$ are equally likely,
8457
then in light of this evidence we would give $A$ a posterior
8458
probability of 77\%, leaving a 23\% chance that Bob's efficacy
8459
is higher.
8460
8461
8462
\section{Predictive distribution}
8463
8464
The analysis we have done so far generates estimates for
8465
Alice and Bob's efficacy, but since efficacy is not directly
8466
observable, it is hard to validate the results.
8467
\index{predictive distribution}
8468
8469
To give the model predictive power, we can use it to answer
8470
a related question: ``If Alice and Bob take the math SAT
8471
again, what is the chance that Alice will do better again?''
8472
8473
We'll answer this question in two steps:
8474
8475
\begin{itemize}
8476
8477
\item We'll use the posterior distribution of efficacy to
8478
generate a predictive distribution of raw score for each test-taker.
8479
8480
\item We'll compare the two predictive distributions to compute
8481
the probability that Alice gets a higher score again.
8482
8483
\end{itemize}
8484
8485
We already have most of the code we need. To compute
8486
the predictive distributions, we can use \verb"MakeRawScoreDist" again:
8487
8488
\begin{code}
8489
exam = Exam()
8490
a_sat = Sat(exam, 780)
8491
b_sat = Sat(exam, 740)
8492
8493
a_pred = exam.MakeRawScoreDist(a_sat)
8494
b_pred = exam.MakeRawScoreDist(b_sat)
8495
\end{code}
8496
8497
Then we can find the likelihood that Alice does better on the second
8498
test, Bob does better, or they tie:
8499
8500
\begin{code}
8501
a_like = thinkbayes.PmfProbGreater(a_pred, b_pred)
8502
b_like = thinkbayes.PmfProbLess(a_pred, b_pred)
8503
c_like = thinkbayes.PmfProbEqual(a_pred, b_pred)
8504
\end{code}
8505
8506
The probability that Alice does better on the second exam is 63\%,
8507
which means that Bob has a 37\% chance of doing as well or better.
8508
8509
Notice that we have more confidence about Alice's efficacy than we do
8510
about the outcome of the next test. The posterior odds are 3:1 that
8511
Alice's efficacy is higher, but only 2:1 that Alice will do better on
8512
the next exam.
8513
8514
8515
\section{Discussion}
8516
8517
\begin{figure}
8518
% sat.py
8519
\centerline{\includegraphics[height=2.5in]{figs/sat_joint.pdf}}
8520
\caption{Joint posterior distribution of \py{p_correct} for Alice and Bob.}
8521
\label{fig.satjoint}
8522
\end{figure}
8523
8524
We started this chapter with the question,
8525
``How strong is the evidence that Alice is better prepared
8526
than Bob?'' On the face of it, that sounds like we want to
8527
test two hypotheses: either Alice is more prepared or Bob is.
8528
8529
But in order to compute likelihoods for these hypotheses, we
8530
have to solve an estimation problem. For each test-taker
8531
we have to find the posterior distribution of either
8532
\verb"p_correct" or \verb"efficacy".
8533
8534
Values like this are called {\bf nuisance parameters} because
8535
we don't care what they are, but we have
8536
to estimate them to answer the question we care about.
8537
\index{nuisance parameter}
8538
8539
One way to visualize the analysis we did in this chapter is
8540
to plot the space of these parameters. \verb"thinkbayes.MakeJoint"
8541
takes two Pmfs, computes their joint distribution, and returns
8542
a joint pmf of each possible pair of values and its probability.
8543
8544
\begin{code}
8545
def MakeJoint(pmf1, pmf2):
8546
joint = Joint()
8547
for v1, p1 in pmf1.Items():
8548
for v2, p2 in pmf2.Items():
8549
joint.Set((v1, v2), p1 * p2)
8550
return joint
8551
\end{code}
8552
8553
This function assumes that the two distributions are independent.
8554
\index{joint distribution}
8555
\index{independence}
8556
8557
Figure~\ref{fig.satjoint} shows the joint posterior distribution of
8558
\verb"p_correct" for Alice and Bob. The diagonal line indicates the
8559
part of the space where \verb"p_correct" is the same for Alice and
8560
Bob. To the right of this line, Alice is more prepared; to the left,
8561
Bob is more prepared.
8562
8563
In \py{TopLevel.Update}, when we compute the likelihoods of $A$ and
8564
$B$, we add up the probability mass on each side of this line. For the
8565
cells that fall on the line, we add up the total mass and split it
8566
between $A$ and $B$.
8567
8568
The process we used in this chapter---estimating nuisance
8569
parameters in order to evaluate the likelihood of competing
8570
hypotheses---is a common Bayesian approach to problems like this.
8571
8572
8573
8574
8575
\chapter{Simulation}
8576
8577
In this chapter I describe my solution to a problem posed
8578
by a patient with a kidney tumor. I think the problem is
8579
important and relevant to patients with these tumors
8580
and doctors treating them.
8581
8582
And I think the solution is interesting because, although it
8583
is a Bayesian approach to the problem, the use of Bayes's theorem
8584
is implicit. I present the solution and my code; at the end
8585
of the chapter I will explain the Bayesian part.
8586
8587
If you want more technical detail than I present here, you can
8588
read my paper on this work at \url{http://arxiv.org/abs/1203.6890}.
8589
8590
8591
\section{The Kidney Tumor problem}
8592
8593
\index{Kidney tumor problem}
8594
\index{Reddit}
8595
I am a frequent reader and occasional contributor to the online statistics
8596
forum at \url{http://reddit.com/r/statistics}. In November 2011, I read
8597
the following message:
8598
8599
\begin{quote}
8600
"I have Stage IV Kidney Cancer and am trying to determine if the
8601
cancer formed before I retired from the military. ... Given the
8602
dates of retirement and detection is it possible to determine when
8603
there was a 50/50 chance that I developed the disease? Is it
8604
possible to determine the probability on the retirement date? My
8605
tumor was 15.5 cm x 15 cm at detection. Grade II."
8606
\end{quote}
8607
8608
I contacted the author of the message and got more information; I learned
8609
that veterans get different benefits if it is "more likely than not"
8610
that a tumor formed while they were in military service (among other
8611
considerations).
8612
8613
Because renal tumors grow slowly, and often do not cause symptoms,
8614
they are sometimes left untreated. As a result, doctors can observe
8615
the rate of growth for untreated tumors by comparing scans from the
8616
same patient at different times. Several papers have reported these
8617
growth rates.
8618
8619
I collected data from a paper by Zhang et al\footnote{Zhang et al,
8620
Distribution of Renal Tumor Growth Rates Determined by Using Serial
8621
Volumetric CT Measurements, January 2009 {\it Radiology}, 250,
8622
137-144.}. I contacted the authors to see if I could get raw data,
8623
but they refused on grounds of medical privacy. Nevertheless, I was
8624
able to extract the data I needed by printing one of their graphs and
8625
measuring it with a ruler.
8626
8627
\begin{figure}
8628
% kidney.py
8629
\centerline{\includegraphics[height=2.5in]{figs/kidney2.pdf}}
8630
\caption{CDF of RDT in doublings per year.}
8631
\label{fig.kidney2}
8632
\end{figure}
8633
8634
They report growth rates in reciprocal doubling time (RDT),
8635
which is in units of doublings per year. So a tumor with $RDT=1$
8636
doubles in volume each year; with $RDT=2$ it quadruples in the same
8637
time, and with $RDT=-1$, it halves. Figure~\ref{fig.kidney2} shows the
8638
distribution of RDT for 53 patients.
8639
\index{doubling time}
8640
8641
The squares are the data points from the paper; the line is a model I
8642
fit to the data. The positive tail fits an exponential distribution
8643
well, so I used a mixture of two exponentials.
8644
\index{exponential distribution}
8645
\index{mixture}
8646
8647
8648
8649
\section{A simple model}
8650
8651
It is usually a good idea to start with a simple model before
8652
trying something more challenging. Sometimes the simple model is
8653
sufficient for the problem at hand, and if not, you can use it
8654
to validate the more complex model.
8655
\index{modeling}
8656
8657
For my simple model, I assume that tumors grow with a constant
8658
doubling time, and that they are three-dimensional in the sense that
8659
if the maximum linear measurement doubles, the volume is multiplied by
8660
eight.
8661
8662
I learned from my correspondent that the time between his discharge
8663
from the military and his diagnosis was 3291 days (about 9 years).
8664
So my first calculation was, ``If this tumor grew at the median
8665
rate, how big would it have been at the date of discharge?''
8666
8667
The median volume doubling time reported by Zhang et al is 811 days.
8668
Assuming 3-dimensional geometry, the doubling time for a linear
8669
measure is three times longer.
8670
8671
\begin{code}
8672
# time between discharge and diagnosis, in days
8673
interval = 3291.0
8674
8675
# doubling time in linear measure is doubling time in volume * 3
8676
dt = 811.0 * 3
8677
8678
# number of doublings since discharge
8679
doublings = interval / dt
8680
8681
# how big was the tumor at time of discharge (diameter in cm)
8682
d1 = 15.5
8683
d0 = d1 / 2.0 ** doublings
8684
\end{code}
8685
8686
You can download the code in this chapter from
8687
\url{http://thinkbayes.com/kidney.py}. For more information
8688
see Section~\ref{download}.
8689
8690
The result, \py{d0}, is about 6 cm. So if this tumor formed after
8691
the date of discharge, it must have grown substantially faster than
8692
the median rate. Therefore I concluded that it is ``more likely than
8693
not'' that this tumor formed before the date of discharge.
8694
8695
In addition, I computed the growth rate that would be implied
8696
if this tumor had formed after the date of discharge. If we
8697
assume an initial size of 0.1 cm, we can compute the number of
8698
doublings to get to a final size of 15.5 cm:
8699
8700
\begin{code}
8701
# assume an initial linear measure of 0.1 cm
8702
d0 = 0.1
8703
d1 = 15.5
8704
8705
# how many doublings would it take to get from d0 to d1
8706
doublings = log2(d1 / d0)
8707
8708
# what linear doubling time does that imply?
8709
dt = interval / doublings
8710
8711
# compute the volumetric doubling time and RDT
8712
vdt = dt / 3
8713
rdt = 365 / vdt
8714
\end{code}
8715
8716
\py{dt} is linear doubling time, so \py{vdt} is volumetric
8717
doubling time, and \py{rdt} is reciprocal doubling
8718
time.
8719
8720
The number of doublings, in linear measure, is 7.3, which implies
8721
an RDT of 2.4. In the data from Zhang et al, only 20\% of tumors
8722
grew this fast during a period of observation. So again,
8723
I concluded that is ``more likely than not'' that the tumor
8724
formed prior to the date of discharge.
8725
8726
These calculations are sufficient to answer the question as
8727
posed, and on behalf of my correspondent, I wrote a letter explaining
8728
my conclusions to the Veterans' Benefit Administration.
8729
\index{Veterans' Benefit Administration}
8730
8731
Later I told a friend, who is an oncologist, about my results. He was
8732
surprised by the growth rates observed by Zhang et al, and by what
8733
they imply about the ages of these tumors. He suggested that the
8734
results might be interesting to researchers and doctors.
8735
8736
But in order to make them useful, I wanted a more general model
8737
of the relationship between age and size.
8738
8739
8740
\section{A more general model}
8741
8742
Given the size of a tumor at time of diagnosis, it would be most
8743
useful to know the probability that the tumor formed before
8744
any given date; in other words, the distribution of ages.
8745
\index{modeling}
8746
\index{simulation}
8747
8748
To find it, I run simulations of tumor growth to get the
8749
distribution of size conditioned on age. Then we can use
8750
a Bayesian approach to get the
8751
distribution of age conditioned on size.
8752
\index{conditional distribution}
8753
8754
The simulation starts with a small tumor and runs these steps:
8755
8756
\begin{enumerate}
8757
8758
\item Choose a growth rate from the distribution of RDT.
8759
8760
\item Compute the size of the tumor at the end of an interval.
8761
8762
\item Record the size of the tumor at each interval.
8763
8764
\item Repeat until the tumor exceeds the maximum relevant size.
8765
8766
\end{enumerate}
8767
8768
For the initial size I chose 0.3 cm, because carcinomas smaller than
8769
that are less likely to be invasive and less likely to have the blood
8770
supply needed for rapid growth (see
8771
\url{http://en.wikipedia.org/wiki/Carcinoma_in_situ}).
8772
\index{carcinoma}
8773
8774
I chose an interval of 245 days (about 8 months) because that is the
8775
median time between measurements in the data source.
8776
8777
For the maximum size I chose 20 cm. In the data source, the range of
8778
observed sizes is 1.0 to 12.0 cm, so we are extrapolating beyond
8779
the observed range at each end, but not by far, and not in a way
8780
likely to have a strong effect on the results.
8781
8782
\begin{figure}
8783
% kidney.py
8784
\centerline{\includegraphics[height=2.5in]{figs/kidney4.pdf}}
8785
\caption{Simulations of tumor growth, size vs. time.}
8786
\label{fig.kidney4}
8787
\end{figure}
8788
8789
The simulation is based on one big simplification:
8790
the growth rate is chosen independently during each interval,
8791
so it does not depend on age, size, or growth rate during
8792
previous intervals.
8793
\index{independence}
8794
8795
In Section~\ref{serial} I review these assumptions and
8796
consider more detailed models. But first let's look at some
8797
examples.
8798
8799
Figure~\ref{fig.kidney4} shows
8800
the size of simulated tumors as a function of
8801
age. The dashed line at 10 cm shows the range of ages for tumors at
8802
that size: the fastest-growing tumor gets there in 8 years; the
8803
slowest takes more than 35.
8804
8805
I am presenting results in terms of linear measurements, but the
8806
calculations are in terms of volume. To convert from one to the
8807
other, again, I use the volume of a sphere with the given
8808
diameter.
8809
\index{volume}
8810
\index{sphere}
8811
8812
8813
\section{Implementation}
8814
8815
Here is the kernel of the simulation:
8816
\index{simulation}
8817
8818
\begin{code}
8819
def MakeSequence(rdt_seq, v0=0.01, interval=0.67, vmax=Volume(20.0)):
8820
seq = v0,
8821
age = 0
8822
8823
for rdt in rdt_seq:
8824
age += interval
8825
final, seq = ExtendSequence(age, seq, rdt, interval)
8826
if final > vmax:
8827
break
8828
8829
return seq
8830
\end{code}
8831
8832
\verb"rdt_seq" is an iterator that yields
8833
random values from the CDF of growth rate.
8834
\py{v0} is the initial volume in mL. \py{interval} is the time step
8835
in years. \py{vmax} is the final volume corresponding to a linear
8836
measurement of 20 cm.
8837
\index{iterator}
8838
8839
\py{Volume} converts from linear measurement in cm to volume
8840
in mL, based on the simplification that the tumor is a sphere:
8841
8842
\begin{code}
8843
def Volume(diameter, factor=4*math.pi/3):
8844
return factor * (diameter/2.0)**3
8845
\end{code}
8846
8847
\py{ExtendSequence} computes the volume of the tumor at the
8848
end of the interval.
8849
8850
\begin{code}
8851
def ExtendSequence(age, seq, rdt, interval):
8852
initial = seq[-1]
8853
doublings = rdt * interval
8854
final = initial * 2**doublings
8855
new_seq = seq + (final,)
8856
cache.Add(age, new_seq, rdt)
8857
8858
return final, new_seq
8859
\end{code}
8860
8861
\py{age} is the age of the tumor at the end of the interval.
8862
\py{seq} is a tuple that contains the volumes so far. \py{rdt} is
8863
the growth rate during the interval, in doublings per year.
8864
\py{interval} is the size of the time step in years.
8865
8866
The return values are \py{final}, the volume of the
8867
tumor at the end of the interval, and \verb"new_seq", a new
8868
tuple containing the volumes in \py{seq} plus the new volume
8869
\py{final}.
8870
8871
\py{Cache.Add} records the age and size of each tumor at the end
8872
of each interval, as explained in the next section.
8873
\index{cache}
8874
8875
8876
\section{Caching the joint distribution}
8877
8878
\begin{figure}
8879
% kidney.py
8880
\centerline{\includegraphics[height=2.5in]{figs/kidney8.pdf}}
8881
\caption{Joint distribution of age and tumor size.}
8882
\label{fig.kidney8}
8883
\end{figure}
8884
8885
Here's how the cache works.
8886
8887
\begin{code}
8888
class Cache(object):
8889
8890
def __init__(self):
8891
self.joint = thinkbayes.Joint()
8892
\end{code}
8893
8894
\py{joint} is a joint Pmf that records the
8895
frequency of each age-size pair, so it approximates the
8896
joint distribution of age and size.
8897
\index{joint distribution}
8898
8899
At the end of each simulated interval, \py{ExtendSequence} calls
8900
\py{Add}:
8901
8902
\begin{code}
8903
# class Cache
8904
8905
def Add(self, age, seq):
8906
final = seq[-1]
8907
cm = Diameter(final)
8908
bucket = round(CmToBucket(cm))
8909
self.joint.Incr((age, bucket))
8910
\end{code}
8911
8912
Again, \py{age} is the age of the tumor, and \py{seq} is the
8913
sequence of volumes so far.
8914
8915
\begin{figure}
8916
% kidney.py
8917
\centerline{\includegraphics[height=2.5in]{figs/kidney6.pdf}}
8918
\caption{Distributions of age, conditioned on size.}
8919
\label{fig.kidney6}
8920
\end{figure}
8921
8922
Before adding the new data to the joint distribution, we use {\tt
8923
Diameter} to convert from volume to diameter in centimeters:
8924
8925
\begin{code}
8926
def Diameter(volume, factor=3/math.pi/4, exp=1/3.0):
8927
return 2 * (factor * volume) ** exp
8928
\end{code}
8929
8930
And
8931
\py{CmToBucket} to convert from centimeters to a discrete bucket
8932
number:
8933
8934
\begin{code}
8935
def CmToBucket(x, factor=10):
8936
return factor * math.log(x)
8937
\end{code}
8938
8939
The buckets are equally spaced on a log scale. Using \py{factor=10}
8940
yields a reasonable number of buckets; for example,
8941
1 cm maps to bucket 0 and 10 cm maps to bucket 23.
8942
\index{log scale}
8943
\index{bucket}
8944
8945
After running the simulations, we can plot the joint distribution
8946
as a pseudocolor plot, where each cell represents the number of
8947
tumors observed at a given size-age pair.
8948
Figure~\ref{fig.kidney8} shows the joint distribution after 1000
8949
simulations.
8950
\index{pseudocolor plot}
8951
8952
8953
8954
\section{Conditional distributions}
8955
8956
\begin{figure}
8957
% kidney.py
8958
\centerline{\includegraphics[height=2.5in]{figs/kidney7.pdf}}
8959
\caption{Percentiles of tumor age as a function of size.}
8960
\label{fig.kidney7}
8961
\end{figure}
8962
8963
By taking a vertical slice from the joint distribution, we can get the
8964
distribution of sizes for any given age. By taking a horizontal
8965
slice, we can get the distribution of ages conditioned on size.
8966
\index{conditional distribution}
8967
8968
Here's the code that reads the joint distribution and builds
8969
the conditional distribution for a given size.
8970
\index{joint distribution}
8971
8972
\begin{code}
8973
# class Cache
8974
8975
def ConditionalCdf(self, bucket):
8976
pmf = self.joint.Conditional(0, 1, bucket)
8977
cdf = pmf.MakeCdf()
8978
return cdf
8979
\end{code}
8980
8981
\verb"bucket" is the integer bucket number corresponding to
8982
tumor size. \py{Joint.Conditional} computes the
8983
PMF of age conditioned on \py{bucket}.
8984
The result is the CDF of age conditioned on \py{bucket}.
8985
8986
Figure~\ref{fig.kidney6} shows several of these CDFs, for
8987
a range of sizes. To summarize these distributions, we can
8988
compute percentiles as a function of size.
8989
\index{percentile}
8990
8991
\begin{code}
8992
percentiles = [95, 75, 50, 25, 5]
8993
8994
for bucket in cache.GetBuckets():
8995
cdf = ConditionalCdf(bucket)
8996
ps = [cdf.Percentile(p) for p in percentiles]
8997
\end{code}
8998
8999
Figure~\ref{fig.kidney7} shows these percentiles for each
9000
size bucket. The data points are computed from the estimated
9001
joint distribution. In the model, size and time are discrete,
9002
which contributes numerical errors, so I also show a least
9003
squares fit for each sequence of percentiles.
9004
\index{least squares fit}
9005
9006
9007
\section{Serial Correlation}
9008
\label{serial}
9009
9010
The results so far are based on a number of modeling decisions;
9011
let's review them and consider which ones are the most
9012
likely sources of error:
9013
\index{modeling error}
9014
9015
\begin{itemize}
9016
9017
\item To convert from linear measure to volume, we assume that
9018
tumors are approximately spherical. This assumption is probably
9019
fine for tumors up to a few centimeters, but not for very
9020
large tumors.
9021
\index{sphere}
9022
9023
\item The distribution of growth rates in the simulations are based on
9024
a continuous model we chose to fit the data reported by Zhang et al,
9025
which is based on 53 patients. The fit is only approximate and, more
9026
importantly, a larger sample would yield a
9027
different distribution.
9028
\index{growth rate}
9029
9030
\item The growth model does not take into account tumor subtype or
9031
grade; this assumption is consistent with the conclusion of Zhang et al:
9032
``Growth rates in renal tumors of different sizes, subtypes and
9033
grades represent a wide range and overlap substantially.''
9034
But with a larger sample, a difference might become apparent.
9035
\index{tumor type}
9036
9037
\item The distribution of growth rate does not depend on the size of
9038
the tumor. This assumption would not be realistic for very
9039
small and very large tumors, whose growth is limited by blood supply.
9040
9041
But tumors observed by Zhang et al ranged from 1 to 12 cm, and they
9042
found no statistically significant relationship between
9043
size and growth rate. So if there is a relationship, it is
9044
likely to be weak, at least in this size range.
9045
9046
\item In the simulations, growth rate during each interval is
9047
independent of previous growth rates. In reality it is plausible
9048
that tumors that have grown quickly in the past are more likely
9049
to grow quickly. In other words, there is probably
9050
a serial correlation in growth rate.
9051
\index{serial correlation}
9052
9053
\end{itemize}
9054
9055
Of these, the first and last seem the most problematic. I'll
9056
investigate serial correlation first, then come back to
9057
spherical geometry.
9058
9059
To simulate correlated growth, I wrote a generator\footnote{If you are
9060
not familiar with Python generators, see
9061
\url{http://wiki.python.org/moin/Generators}.} that yields a
9062
correlated series from a given Cdf. Here's how the algorithm works:
9063
\index{generator}
9064
9065
\begin{enumerate}
9066
9067
\item Generate correlated values from a Gaussian distribution.
9068
This is easy to do because we can compute the distribution
9069
of the next value conditioned on the previous value.
9070
\index{Gaussian distribution}
9071
9072
\item Transform each value to its cumulative probability using
9073
the Gaussian CDF.
9074
\index{cumulative probability}
9075
9076
\item Transform each cumulative probability to the corresponding value
9077
using the given Cdf.
9078
9079
\end{enumerate}
9080
9081
Here's what that looks like in code:
9082
9083
\begin{code}
9084
def CorrelatedGenerator(cdf, rho):
9085
x = random.gauss(0, 1)
9086
yield Transform(x)
9087
9088
sigma = math.sqrt(1 - rho**2);
9089
while True:
9090
x = random.gauss(x * rho, sigma)
9091
yield Transform(x)
9092
\end{code}
9093
9094
\py{cdf} is the desired Cdf; \py{rho} is the desired correlation.
9095
The values of \py{x} are Gaussian; \py{Transform} converts them
9096
to the desired distribution.
9097
9098
The first value of \py{x} is Gaussian with mean 0 and standard
9099
deviation 1. For subsequent values, the mean and standard deviation
9100
depend on the previous value. Given the previous \py{x}, the mean of the
9101
next value is \py{x * rho}, and the variance is \py{1 - rho**2}.
9102
\index{correlated random value}
9103
9104
\py{Transform} maps from each
9105
Gaussian value, \py{x}, to a value from the given Cdf, \py{y}.
9106
9107
\begin{code}
9108
def Transform(x):
9109
p = thinkbayes.GaussianCdf(x)
9110
y = cdf.Value(p)
9111
return y
9112
\end{code}
9113
9114
\py{GaussianCdf} computes the CDF of the standard Gaussian
9115
distribution at \py{x}, returning a cumulative probability.
9116
\py{Cdf.Value} maps from a cumulative probability to the
9117
corresponding value in \py{cdf}.
9118
9119
Depending on the shape of \py{cdf}, information can
9120
be lost in transformation, so the actual correlation might be
9121
lower than \py{rho}. For example, when I generate
9122
10000 values from the distribution of growth rates with
9123
\py{rho=0.4}, the actual correlation is 0.37.
9124
But since we are guessing at the right correlation anyway,
9125
that's close enough.
9126
9127
Remember that \py{MakeSequence} takes an iterator as an argument.
9128
That interface allows it to work with different generators:
9129
\index{generator}
9130
9131
\begin{code}
9132
iterator = UncorrelatedGenerator(cdf)
9133
seq1 = MakeSequence(iterator)
9134
9135
iterator = CorrelatedGenerator(cdf, rho)
9136
seq2 = MakeSequence(iterator)
9137
\end{code}
9138
9139
In this example, \py{seq1} and \py{seq2} are
9140
drawn from the same distribution, but the values in \py{seq1}
9141
are uncorrelated and the values in \py{seq2} are correlated
9142
with a coefficient of approximately \py{rho}.
9143
\index{serial correlation}
9144
9145
Now we can see what effect serial correlation has on the results;
9146
the following table shows percentiles of age for a 6 cm tumor,
9147
using the uncorrelated generator and a correlated generator
9148
with target $\rho = 0.4$.
9149
\index{percentile}
9150
9151
\begin{table}
9152
\input{tables/kidney_table2}
9153
\caption{Percentiles of tumor age conditioned on size.}
9154
\end{table}
9155
9156
Correlation makes the fastest growing tumors faster and the slowest
9157
slower, so the range of ages is wider. The difference is modest for
9158
low percentiles, but for the 95th percentile it is more than 6 years.
9159
To compute these percentiles precisely, we would need a better
9160
estimate of the actual serial correlation.
9161
9162
However, this model is sufficient to answer the question
9163
we started with: given a tumor with a linear dimension of
9164
15.5 cm, what is the probability that it formed more than
9165
8 years ago?
9166
9167
Here's the code:
9168
9169
\begin{code}
9170
# class Cache
9171
9172
def ProbOlder(self, cm, age):
9173
bucket = CmToBucket(cm)
9174
cdf = self.ConditionalCdf(bucket)
9175
p = cdf.Prob(age)
9176
return 1-p
9177
\end{code}
9178
9179
\py{cm} is the size of the tumor; \py{age} is the age threshold
9180
in years. \py{ProbOlder} converts size to a bucket number,
9181
gets the Cdf of age conditioned on bucket, and computes the
9182
probability that age exceeds the given value.
9183
9184
With no serial correlation, the probability that a
9185
15.5 cm tumor is older than 8 years is 0.999, or almost certain.
9186
With correlation 0.4, faster-growing tumors are more likely, but
9187
the probability is still 0.995. Even with correlation 0.8, the
9188
probability is 0.978.
9189
9190
Another likely source of error is the assumption that tumors are
9191
approximately spherical. For a tumor with linear dimensions 15.5 x 15
9192
cm, this assumption is probably not valid. If, as seems likely, a
9193
tumor this size
9194
is relatively flat, it might have the same volume as a 6 cm sphere.
9195
With this smaller volume and correlation 0.8, the probability of age
9196
greater than 8 is still 95\%.
9197
9198
So even taking into account modeling errors, it is unlikely that such
9199
a large tumor could have formed less than 8 years prior to the date of
9200
diagnosis.
9201
\index{modeling error}
9202
9203
9204
\section{Discussion}
9205
9206
Well, we got through a whole chapter without using Bayes's theorem or
9207
the \py{Suite} class that encapsulates Bayesian updates. What
9208
happened?
9209
9210
One way to think about Bayes's theorem is as an algorithm for
9211
inverting conditional probabilities. Given \p{B|A}, we can compute
9212
\p{A|B}, provided we know \p{A} and \p{B}. Of course this algorithm
9213
is only useful if, for some reason, it is easier to compute \p{B|A}
9214
than \p{A|B}.
9215
9216
In this example, it is. By running simulations, we can estimate the
9217
distribution of size conditioned on age, or \p{size|age}. But it is
9218
harder to get the distribution of age conditioned on size, or
9219
\p{age|size}. So this seems like a perfect opportunity to use Bayes's
9220
theorem.
9221
9222
The reason I didn't is computational efficiency. To estimate
9223
\p{size|age} for any given size, you have to run a lot of simulations.
9224
Along the way, you end up computing \p{size|age} for a lot of sizes.
9225
In fact, you end up computing the entire joint distribution of size
9226
and age, \p{size, age}.
9227
\index{joint distribution}
9228
9229
And once you have the joint distribution, you don't really need
9230
Bayes's theorem, you can extract \p{age|size} by taking slices from
9231
the joint distribution, as demonstrated in \py{ConditionalCdf}.
9232
\index{conditional distribution}
9233
9234
So we side-stepped Bayes, but he was with us in spirit.
9235
9236
9237
\chapter{A Hierarchical Model}
9238
\label{hierarchical}
9239
9240
9241
\section{The Geiger counter problem}
9242
9243
I got the idea for the following problem from Tom Campbell-Ricketts,
9244
author of the Maximum Entropy blog at
9245
\url{http://maximum-entropy-blog.blogspot.com}. And he got the idea
9246
from E.~T.~Jaynes, author of the classic {\em Probability Theory: The
9247
Logic of Science}:
9248
\index{Jaynes, E.~T.}
9249
\index{Campbell-Ricketts, Tom}
9250
\index{Geiger counter problem}
9251
9252
\begin{quote}
9253
Suppose that a radioactive source emits particles toward
9254
a Geiger counter at an average rate of $r$ particles per second,
9255
but the counter only registers a fraction, $f$, of the particles
9256
that hit it. If $f$ is 10\% and
9257
the counter registers 15 particles in a one second
9258
interval, what is the posterior distribution of $n$, the actual
9259
number of particles that hit the counter, and $r$, the average
9260
rate particles are emitted?
9261
\end{quote}
9262
9263
To get started on a problem like this, think about the chain of
9264
causation that starts with the parameters of the system and ends
9265
with the observed data:
9266
\index{causation}
9267
9268
\begin{enumerate}
9269
9270
\item The source emits particles at an average rate, $r$.
9271
9272
\item During any given second, the source emits $n$ particles
9273
toward the counter.
9274
9275
\item Out of those $n$ particles, some number, $k$, get counted.
9276
9277
\end{enumerate}
9278
9279
The probability that an atom decays is the same at any point in time,
9280
so radioactive decay is well modeled by a Poisson process. Given $r$,
9281
the distribution of $n$ is Poisson distribution with parameter $r$.
9282
\index{radioactive decay}
9283
\index{Poisson process}
9284
9285
And if we assume that the probability of detection for each particle
9286
is independent of the others, the distribution of $k$ is the binomial
9287
distribution with parameters $n$ and $f$.
9288
\index{binomial distribution}
9289
9290
Given the parameters of the system, we can find the distribution of
9291
the data. So we can solve what is called the {\bf forward problem}.
9292
\index{forward problem}
9293
9294
Now we want to go the other way: given the data, we
9295
want the distribution of the parameters. This is called
9296
the {\bf inverse problem}. And if you can solve the forward
9297
problem, you can use Bayesian methods to solve the inverse problem.
9298
\index{inverse problem}
9299
9300
9301
\section{Start simple}
9302
9303
\begin{figure}
9304
% jaynes.py
9305
\centerline{\includegraphics[height=2.5in]{figs/jaynes1.pdf}}
9306
\caption{Posterior distribution of $n$ for three values of $r$.}
9307
\label{fig.jaynes1}
9308
\end{figure}
9309
9310
Let's start with a simple version of the problem where we know
9311
the value of $r$. We are given the value of $f$, so all we
9312
have to do is estimate $n$.
9313
9314
I define a Suite called \py{Detector} that models the behavior
9315
of the detector and estimates $n$.
9316
9317
\begin{code}
9318
class Detector(thinkbayes.Suite):
9319
9320
def __init__(self, r, f, high=500, step=1):
9321
pmf = thinkbayes.MakePoissonPmf(r, high, step=step)
9322
thinkbayes.Suite.__init__(self, pmf, name=r)
9323
self.r = r
9324
self.f = f
9325
\end{code}
9326
9327
If the average emission rate is $r$ particles per second, the
9328
distribution of $n$ is Poisson with parameter $r$.
9329
\py{high} and \py{step} determine the upper bound for $n$
9330
and the step size between hypothetical values.
9331
\index{Poisson distribution}
9332
9333
Now we need a likelihood function:
9334
\index{likelihood}
9335
9336
\begin{code}
9337
# class Detector
9338
9339
def Likelihood(self, data, hypo):
9340
k = data
9341
n = hypo
9342
p = self.f
9343
9344
return thinkbayes.EvalBinomialPmf(k, n, p)
9345
\end{code}
9346
9347
\py{data} is the number of particles detected, and \py{hypo} is
9348
the hypothetical number of particles emitted, $n$.
9349
9350
If there are actually $n$ particles, and the probability of detecting
9351
any one of them is $f$, the probability of detecting $k$ particles is
9352
given by the binomial distribution.
9353
\index{binomial distribution}
9354
9355
That's it for the Detector. We can try it out for a range
9356
of values of $r$:
9357
9358
\begin{code}
9359
f = 0.1
9360
k = 15
9361
9362
for r in [100, 250, 400]:
9363
suite = Detector(r, f, step=1)
9364
suite.Update(k)
9365
print suite.MaximumLikelihood()
9366
\end{code}
9367
9368
Figure~\ref{fig.jaynes1} shows the posterior distribution of $n$ for
9369
several given values of $r$.
9370
9371
9372
\section{Make it hierarchical}
9373
9374
In the previous section, we assume $r$ is known. Now let's
9375
relax that assumption. I define another Suite, called \py{Emitter},
9376
that models the behavior of the emitter and estimates $r$:
9377
9378
\begin{code}
9379
class Emitter(thinkbayes.Suite):
9380
9381
def __init__(self, rs, f=0.1):
9382
detectors = [Detector(r, f) for r in rs]
9383
thinkbayes.Suite.__init__(self, detectors)
9384
\end{code}
9385
9386
\py{rs} is a sequence of hypothetical value for $r$. \py{detectors}
9387
is a sequence of Detector objects, one for each value of $r$. The
9388
values in the Suite are Detectors, so Emitter is a {\bf meta-Suite};
9389
that is, a Suite that contains other Suites as values.
9390
\index{meta-Suite}
9391
9392
To update the Emitter, we have to compute the likelihood of the data
9393
under each hypothetical value of $r$. But each value of $r$ is
9394
represented by a Detector that contains a range of values for $n$.
9395
9396
To compute the likelihood of the data for a given Detector, we loop
9397
through the values of $n$ and add up the total probability of $k$.
9398
That's what \py{SuiteLikelihood} does:
9399
9400
\begin{code}
9401
# class Detector
9402
9403
def SuiteLikelihood(self, data):
9404
total = 0
9405
for hypo, prob in self.Items():
9406
like = self.Likelihood(data, hypo)
9407
total += prob * like
9408
return total
9409
\end{code}
9410
9411
Now we can write the Likelihood function for the Emitter:
9412
9413
\begin{code}
9414
# class Emitter
9415
9416
def Likelihood(self, data, hypo):
9417
detector = hypo
9418
like = detector.SuiteLikelihood(data)
9419
return like
9420
\end{code}
9421
9422
Each \py{hypo} is a Detector, so we can invoke
9423
\py{SuiteLikelihood} to get the likelihood of the data under
9424
the hypothesis.
9425
9426
After we update the Emitter, we have to update each of the
9427
Detectors, too.
9428
9429
\begin{code}
9430
# class Emitter
9431
9432
def Update(self, data):
9433
thinkbayes.Suite.Update(self, data)
9434
9435
for detector in self.Values():
9436
detector.Update()
9437
\end{code}
9438
9439
A model like this, with multiple levels of Suites, is called {\bf
9440
hierarchical}. \index{hierarchical model}
9441
9442
9443
\section{A little optimization}
9444
9445
You might recognize \py{SuiteLikelihood}; we saw it
9446
in Section~\ref{suitelike}. At the time, I pointed out that
9447
we didn't really need it, because the total probability
9448
computed by \py{SuiteLikelihood} is exactly the normalizing
9449
constant computed and returned by \py{Update}.
9450
\index{normalizing constant}
9451
9452
So instead of updating the Emitter and then updating the
9453
Detectors, we can do both steps at the same time, using
9454
the result from \py{Detector.Update} as the likelihood
9455
of Emitter.
9456
9457
Here's the streamlined version of \py{Emitter.Likelihood}:
9458
9459
\begin{code}
9460
# class Emitter
9461
9462
def Likelihood(self, data, hypo):
9463
return hypo.Update(data)
9464
\end{code}
9465
9466
And with this version of \py{Likelihood} we can use the
9467
default version of \py{Update}. So this version has fewer
9468
lines of code, and it runs faster because it does not compute
9469
the normalizing constant twice.
9470
\index{optimization}
9471
9472
9473
\section{Extracting the posteriors}
9474
9475
\begin{figure}
9476
% jaynes.py
9477
\centerline{\includegraphics[height=2.5in]{figs/jaynes2.pdf}}
9478
\caption{Posterior distributions of $n$ and $r$.}
9479
\label{fig.jaynes2}
9480
\end{figure}
9481
9482
After we update the Emitter, we can get the posterior distribution
9483
of $r$ by looping through the Detectors and their probabilities:
9484
9485
\begin{code}
9486
# class Emitter
9487
9488
def DistOfR(self):
9489
items = [(detector.r, prob) for detector, prob in self.Items()]
9490
return thinkbayes.MakePmfFromItems(items)
9491
\end{code}
9492
9493
\py{items} is a list of values of $r$ and their probabilities.
9494
The result is the Pmf of $r$.
9495
9496
To get the posterior distribution of $n$, we have to compute
9497
the mixture of the Detectors. We can use
9498
\py{thinkbayes.MakeMixture}, which takes a meta-Pmf that maps
9499
from each distribution to its probability. And that's exactly
9500
what the Emitter is:
9501
9502
\begin{code}
9503
# class Emitter
9504
9505
def DistOfN(self):
9506
return thinkbayes.MakeMixture(self)
9507
\end{code}
9508
9509
Figure~\ref{fig.jaynes2} shows the results. Not surprisingly, the
9510
most likely value for $n$ is 150. Given $f$ and $n$, the expected
9511
count is $k = f n$, so given $f$ and $k$, the expected value of $n$ is
9512
$k / f$, which is 150.
9513
9514
And if 150 particles are emitted in one second, the most likely value
9515
of $r$ is 150 particles per second. So the posterior distribution of
9516
$r$ is also centered on 150.
9517
9518
The posterior distributions of $r$ and $n$ are similar;
9519
the only difference is that we are slightly less certain about $n$.
9520
In general, we can be more certain about the long-range emission rate,
9521
$r$, than about the number of particles emitted in any particular second,
9522
$n$.
9523
9524
You can download the code in this chapter from
9525
\url{http://thinkbayes.com/jaynes.py}. For more information see
9526
Section~\ref{download}.
9527
9528
9529
\section{Discussion}
9530
9531
The Geiger counter problem demonstrates the connection between
9532
causation and hierarchical modeling. In the example, the
9533
emission rate $r$ has a causal effect on the number of particles,
9534
$n$, which has a causal effect on the particle count, $k$.
9535
\index{Geiger counter problem}
9536
\index{causation}
9537
9538
The hierarchical model reflects the structure of the
9539
system, with causes at the top and effects at the bottom.
9540
\index{hierarchical model}
9541
9542
\begin{enumerate}
9543
9544
\item At the top level, we start with a range of hypothetical
9545
values for $r$.
9546
9547
\item For each value of $r$, we have a range of values for $n$,
9548
and the prior distribution of $n$ depends on $r$.
9549
9550
\item When we update the model, we go bottom-up. We compute
9551
a posterior distribution of $n$ for each value of $r$, then
9552
compute the posterior distribution of $r$.
9553
9554
\end{enumerate}
9555
9556
So causal information flows down the hierarchy, and inference flows
9557
up.
9558
9559
9560
\section{Exercises}
9561
9562
\begin{exercise}
9563
This exercise is also inspired by an example in Jaynes, {\em
9564
Probability Theory}.
9565
9566
Suppose you buy a mosquito trap that is supposed to reduce the
9567
population of mosquitoes near your house. Each
9568
week, you empty the trap and count the number of mosquitoes
9569
captured. After the first week, you count 30 mosquitoes.
9570
After the second week, you count 20 mosquitoes. Estimate the
9571
percentage change in the number of mosquitoes in your yard.
9572
9573
To answer this question, you have to make some modeling
9574
decisions. Here are some suggestions:
9575
9576
\begin{itemize}
9577
9578
\item Suppose that each week a large number of mosquitoes, $N$, is bred
9579
in a wetland near your home.
9580
9581
\item During the week, some fraction of
9582
them, $f_1$, wander into your yard, and of those some fraction, $f_2$,
9583
are caught in the trap.
9584
9585
\item Your solution should take into account your prior belief
9586
about how much $N$ is likely to change from one week to the next.
9587
You can do that by adding a level to the hierarchy to
9588
model the percent change in $N$.
9589
9590
\end{itemize}
9591
9592
\end{exercise}
9593
9594
9595
\chapter{Dealing with Dimensions}
9596
\label{species}
9597
9598
\section{Belly button bacteria}
9599
9600
Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen
9601
science project with the goal of identifying bacterial species that
9602
can be found in human navels (\url{http://bbdata.yourwildlife.org}).
9603
The project might seem whimsical, but it is part of an increasing
9604
interest in the human microbiome, the set of microorganisms that live
9605
on human skin and parts of the body.
9606
\index{biodiversity}
9607
\index{belly button}
9608
\index{bacteria}
9609
\index{microbiome}
9610
9611
In their pilot study, BBB2 researchers collected swabs from the navels
9612
of 60 volunteers, used multiplex pyrosequencing to extract and sequence
9613
fragments of 16S rDNA, then identified the species or genus the
9614
fragments came from. Each identified fragment is called a ``read.''
9615
\index{navel}
9616
\index{rDNA}
9617
\index{pyrosequencing}
9618
9619
We can use these data to answer several related questions:
9620
9621
\begin{itemize}
9622
9623
\item Based on the number of species observed, can we estimate
9624
the total number of species in the environment?
9625
\index{species}
9626
9627
\item Can we estimate the prevalence of each species; that is, the
9628
fraction of the total population belonging to each species?
9629
\index{prevalence}
9630
9631
\item If we are planning to collect additional samples, can we predict
9632
how many new species we are likely to discover?
9633
9634
\item How many additional reads are needed to increase the
9635
fraction of observed species to a given threshold?
9636
9637
\end{itemize}
9638
9639
These questions make up what is called the {\bf Unseen Species problem}.
9640
\index{Unseen Species problem}
9641
9642
9643
\section{Lions and tigers and bears}
9644
9645
I'll start with a simplified version of the problem where we know that
9646
there are exactly three species. Let's call them lions, tigers and
9647
bears. Suppose we visit a wild animal preserve and see 3 lions, 2
9648
tigers and one bear.
9649
\index{lions and tigers and bears}
9650
9651
If we have an equal chance of observing any animal in the preserve,
9652
the number of each species we see is governed by the multinomial
9653
distribution. If the prevalence of lions and tigers and bears is
9654
\verb"p_lion" and \verb"p_tiger" and \verb"p_bear", the likelihood of
9655
seeing 3 lions, 2 tigers and one bear is proportional to
9656
\index{multinomial distribution}
9657
9658
\begin{code}
9659
p_lion**3 * p_tiger**2 * p_bear**1
9660
\end{code}
9661
9662
An approach that is tempting, but not correct, is to use beta
9663
distributions, as in Section~\ref{beta}, to describe the prevalence of
9664
each species separately. For example, we saw 3 lions and 3 non-lions;
9665
if we think of that as 3 ``heads'' and 3 ``tails,'' then the posterior
9666
distribution of \verb"p_lion" is:
9667
\index{beta distribution}
9668
9669
\begin{code}
9670
beta = thinkbayes.Beta()
9671
beta.Update((3, 3))
9672
print beta.MaximumLikelihood()
9673
\end{code}
9674
9675
The maximum likelihood estimate for \verb"p_lion" is the observed
9676
rate, 50\%. Similarly the MLEs for \verb"p_tiger" and \verb"p_bear"
9677
are 33\% and 17\%.
9678
\index{maximum likelihood}
9679
9680
But there are two problems:
9681
9682
\begin{enumerate}
9683
9684
\item We have implicitly used a prior for each species that is uniform
9685
from 0 to 1, but since we know that there are three species, that
9686
prior is not correct. The right prior should have a mean of 1/3,
9687
and there should be zero likelihood that any species has a
9688
prevalence of 100\%.
9689
9690
\item The distributions for each species are not independent, because
9691
the prevalences have to add up to 1. To capture this dependence, we
9692
need a joint distribution for the three prevalences.
9693
\index{independence}
9694
\index{joint distribution}
9695
9696
\end{enumerate}
9697
9698
We can use a Dirichlet distribution to solve both of these problems
9699
(see \url{http://en.wikipedia.org/wiki/Dirichlet_distribution}). In
9700
the same way we used the beta distribution to describe the
9701
distribution of bias for a coin, we can use a Dirichlet
9702
distribution to describe the joint distribution of \verb"p_lion",
9703
\verb"p_tiger" and \verb"p_bear".
9704
\index{beta distribution}
9705
\index{Dirichlet distribution}
9706
9707
The Dirichlet distribution is the multi-dimensional generalization
9708
of the beta distribution. Instead of two possible outcomes, like
9709
heads and tails, the Dirichlet distribution handles any number of
9710
outcomes: in this example, three species.
9711
9712
If there are \py{n} outcomes, the Dirichlet distribution is
9713
described by \py{n} parameters, written $\alpha_1$ through $\alpha_n$.
9714
9715
Here's the definition, from \py{thinkbayes.py}, of a class that
9716
represents a Dirichlet distribution:
9717
\index{numpy}
9718
9719
\begin{code}
9720
class Dirichlet(object):
9721
9722
def __init__(self, n):
9723
self.n = n
9724
self.params = numpy.ones(n, dtype=numpy.int)
9725
\end{code}
9726
9727
\py{n} is the number of dimensions; initially the parameters
9728
are all 1. I use a \py{numpy} array to store the parameters
9729
so I can take advantage of array operations.
9730
9731
Given a Dirichlet distribution, the marginal distribution
9732
for each prevalence is a beta distribution, which we can
9733
compute like this:
9734
9735
\begin{code}
9736
def MarginalBeta(self, i):
9737
alpha0 = self.params.sum()
9738
alpha = self.params[i]
9739
return Beta(alpha, alpha0-alpha)
9740
\end{code}
9741
9742
\py{i} is the index of the marginal distribution we want.
9743
\py{alpha0} is the sum of the parameters; \py{alpha} is the
9744
parameter for the given species.
9745
\index{marginal distribution}
9746
9747
In the example, the prior marginal distribution for each species
9748
is \py{Beta(1, 2)}. We can compute the prior means like
9749
this:
9750
9751
\begin{code}
9752
dirichlet = thinkbayes.Dirichlet(3)
9753
for i in range(3):
9754
beta = dirichlet.MarginalBeta(i)
9755
print beta.Mean()
9756
\end{code}
9757
9758
As expected, the prior mean prevalence for each species is 1/3.
9759
9760
To update the Dirichlet distribution, we add the
9761
observations to the parameters like this:
9762
9763
\begin{code}
9764
def Update(self, data):
9765
m = len(data)
9766
self.params[:m] += data
9767
\end{code}
9768
9769
Here \py{data} is a sequence of counts in the same order as {\tt
9770
params}, so in this example, it should be the number of lions,
9771
tigers and bears.
9772
9773
\py{data} can be shorter than \py{params}; in that
9774
case there are some species that have not been
9775
observed.
9776
9777
Here's code that updates \py{dirichlet} with the observed data and
9778
computes the posterior marginal distributions.
9779
9780
\begin{code}
9781
data = [3, 2, 1]
9782
dirichlet.Update(data)
9783
9784
for i in range(3):
9785
beta = dirichlet.MarginalBeta(i)
9786
pmf = beta.MakePmf()
9787
print i, pmf.Mean()
9788
\end{code}
9789
9790
\begin{figure}
9791
% species.py
9792
\centerline{\includegraphics[height=2.5in]{figs/species1.pdf}}
9793
\caption{Distribution of prevalences for three species.}
9794
\label{fig.species1}
9795
\end{figure}
9796
9797
Figure~\ref{fig.species1} shows the results. The posterior
9798
mean prevalences are 44\%, 33\%, and 22\%.
9799
9800
9801
\section{The hierarchical version}
9802
9803
We have solved a simplified version of the problem: if we
9804
know how many species there are, we can estimate the prevalence
9805
of each.
9806
\index{prevalence}
9807
9808
Now let's get back to the original problem, estimating the total
9809
number of species. To solve this problem I'll define a meta-Suite,
9810
which is a Suite that contains other Suites as hypotheses. In this
9811
case, the top-level Suite contains hypotheses about the number of
9812
species; the bottom level contains hypotheses about prevalences.
9813
\index{hierarchical model}
9814
\index{meta-Suite}
9815
9816
Here's the class definition:
9817
9818
\begin{code}
9819
class Species(thinkbayes.Suite):
9820
9821
def __init__(self, ns):
9822
hypos = [thinkbayes.Dirichlet(n) for n in ns]
9823
thinkbayes.Suite.__init__(self, hypos)
9824
\end{code}
9825
9826
\verb"__init__" takes a list of possible values for \py{n} and
9827
makes a list of Dirichlet objects.
9828
9829
Here's the code that creates the top-level suite:
9830
9831
\begin{code}
9832
ns = range(3, 30)
9833
suite = Species(ns)
9834
\end{code}
9835
9836
\py{ns} is the list of possible values for \py{n}. We have seen 3
9837
species, so there have to be at least that many. I chose an upper
9838
bound that seems reasonable, but we will check later that the
9839
probability of exceeding this bound is low. And at least initially
9840
we assume that any value in this range is equally likely.
9841
9842
To update a hierarchical model, you have to update all levels.
9843
Usually you have to update the bottom
9844
level first and work up, but in this case we can
9845
update the top level first:
9846
9847
\begin{code}
9848
#class Species
9849
9850
def Update(self, data):
9851
thinkbayes.Suite.Update(self, data)
9852
for hypo in self.Values():
9853
hypo.Update(data)
9854
\end{code}
9855
9856
\py{Species.Update} invokes \py{Update} in the parent class,
9857
then loops through the sub-hypotheses and updates them.
9858
9859
Now all we need is a likelihood function:
9860
9861
\begin{code}
9862
# class Species
9863
9864
def Likelihood(self, data, hypo):
9865
dirichlet = hypo
9866
like = 0
9867
for i in range(1000):
9868
like += dirichlet.Likelihood(data)
9869
9870
return like
9871
\end{code}
9872
9873
\py{data} is a sequence of
9874
observed counts; \py{hypo} is a Dirichlet object.
9875
\py{Species.Likelihood} calls
9876
\py{Dirichlet.Likelihood} 1000 times and returns the total.
9877
9878
Why call it 1000 times? Because {\tt
9879
Dirichlet.Likelihood} doesn't actually compute the likelihood of the
9880
data under the whole Dirichlet distribution. Instead, it draws one
9881
sample from the hypothetical distribution and computes the likelihood
9882
of the data under the sampled set of prevalences.
9883
9884
Here's what it looks like:
9885
9886
\begin{code}
9887
# class Dirichlet
9888
9889
def Likelihood(self, data):
9890
m = len(data)
9891
if self.n < m:
9892
return 0
9893
9894
x = data
9895
p = self.Random()
9896
q = p[:m]**x
9897
return q.prod()
9898
\end{code}
9899
9900
The length of \py{data} is the number of species observed. If
9901
we see more species than we thought existed, the likelihood is 0.
9902
9903
\index{multinomial distribution}
9904
Otherwise we select a random set of prevalences, \py{p}, and
9905
compute the multinomial PMF, which is
9906
%
9907
\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]
9908
%
9909
$p_i$ is the prevalence of the $i$th species, and $x_i$ is the
9910
observed number. The first term, $c_x$, is the multinomial
9911
coefficient; I leave it out of the computation because it is
9912
a multiplicative factor that depends only
9913
on the data, not the hypothesis, so it gets normalized away
9914
(see \url{http://en.wikipedia.org/wiki/Multinomial_distribution}).
9915
\index{multinomial coefficient}
9916
9917
\py{m} is the number of observed species.
9918
We only need the first \py{m} elements of \py{p};
9919
for the others, $x_i$ is 0, so
9920
$p_i^{x_i}$ is 1, and we can leave them out of the product.
9921
9922
9923
\section{Random sampling}
9924
\label{randomdir}
9925
9926
There are two ways to generate a random sample from a Dirichlet
9927
distribution. One is to use the marginal beta distributions, but in
9928
that case you have to select one at a time and scale the rest so they
9929
add up to 1 (see
9930
\url{http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation}).
9931
\index{random sample}
9932
9933
A less obvious, but faster, way is to select values from \py{n} gamma
9934
distributions, then normalize by dividing through by the total.
9935
Here's the code:
9936
\index{numpy}
9937
\index{gamma distribution}
9938
9939
\begin{code}
9940
# class Dirichlet
9941
9942
def Random(self):
9943
p = numpy.random.gamma(self.params)
9944
return p / p.sum()
9945
\end{code}
9946
9947
Now we're ready to look at some results. Here is the code that extracts
9948
the posterior distribution of \py{n}:
9949
9950
\begin{code}
9951
def DistOfN(self):
9952
pmf = thinkbayes.Pmf()
9953
for hypo, prob in self.Items():
9954
pmf.Set(hypo.n, prob)
9955
return pmf
9956
\end{code}
9957
9958
\py{DistOfN} iterates
9959
through the top-level hypotheses and accumulates the probability
9960
of each \py{n}.
9961
9962
\begin{figure}
9963
% species.py
9964
\centerline{\includegraphics[height=2.5in]{figs/species2.pdf}}
9965
\caption{Posterior distribution of \py{n}.}
9966
\label{fig.species2}
9967
\end{figure}
9968
9969
Figure~\ref{fig.species2} shows the result. The most likely value is 4.
9970
Values from 3 to 7 are reasonably likely; after that the probabilities
9971
drop off quickly. The probability that there are 29 species is
9972
low enough to be negligible; if we chose a higher bound,
9973
we would get nearly the same result.
9974
9975
Remember that this result is based on a uniform prior for \py{n}. If
9976
we have background information about the number of species in the
9977
environment, we might choose a different prior. \index{uniform
9978
distribution}
9979
9980
9981
\section{Optimization}
9982
9983
I have to admit that I am proud of this example. The Unseen Species
9984
problem is not easy, and I think this solution is simple and clear,
9985
and takes surprisingly few lines of code (about 50 so far).
9986
9987
The only problem is that it is slow. It's good enough for the example
9988
with only 3 observed species, but not good enough for the belly button
9989
data, with more than 100 species in some samples.
9990
9991
The next few sections present a series of optimizations we need to
9992
make this solution scale. Before we get into the details, here's
9993
a road map.
9994
\index{optimization}
9995
9996
\begin{itemize}
9997
9998
\item The first step is to recognize that if we update the Dirichlet
9999
distributions with the same data, the first \py{m} parameters are
10000
the same for all of them. The only difference is the number of
10001
hypothetical unseen species. So we don't really need \py{n}
10002
Dirichlet objects; we can store the parameters in the top level of
10003
the hierarchy. \py{Species2} implements this optimization.
10004
10005
\item \py{Species2} also uses the same set of random values for all
10006
of the hypotheses. This saves time generating random values, but it
10007
has a second benefit that turns out to be more important: by giving
10008
all hypotheses the same selection from the sample space, we make
10009
the comparison between the hypotheses more fair, so it takes
10010
fewer iterations to converge.
10011
10012
\item Even with these changes there is a major performance problem.
10013
As the number of observed species increases, the array of random
10014
prevalences gets bigger, and the chance of choosing one that is
10015
approximately right becomes small. So the vast majority of
10016
iterations yield small likelihoods that don't contribute much to the
10017
total, and don't discriminate between hypotheses.
10018
10019
The solution is to do the updates one species at a time. {\tt
10020
Species4} is a simple implementation of this strategy using
10021
Dirichlet objects to represent the sub-hypotheses.
10022
10023
\item Finally, \py{Species5} combines the sub-hypotheses into the top
10024
level and uses \py{numpy} array operations to speed things up.
10025
\index{numpy}
10026
10027
\end{itemize}
10028
10029
If you are not interested in the details, feel free to skip to
10030
Section~\ref{belly} where we look at results from the belly
10031
button data.
10032
10033
10034
\section{Collapsing the hierarchy}
10035
\label{collapsing}
10036
10037
All of the bottom-level Dirichlet distributions are updated
10038
with the same data, so the first \py{m} parameters are the same for
10039
all of them.
10040
We can eliminate them and merge the parameters into
10041
the top-level suite. \py{Species2} implements this optimization:
10042
\index{numpy}
10043
10044
\begin{code}
10045
class Species2(object):
10046
10047
def __init__(self, ns):
10048
self.ns = ns
10049
self.probs = numpy.ones(len(ns), dtype=numpy.double)
10050
self.params = numpy.ones(self.high, dtype=numpy.int)
10051
\end{code}
10052
10053
\py{ns} is the list of hypothetical values for \py{n};
10054
\py{probs} is the list of corresponding probabilities. And
10055
\py{params} is the sequence of Dirichlet parameters, initially
10056
all 1.
10057
10058
\py{Species2.Update} updates both levels of
10059
the hierarchy: first the probability for each value of \py{n},
10060
then the Dirichlet parameters:
10061
\index{numpy}
10062
10063
\begin{code}
10064
# class Species2
10065
10066
def Update(self, data):
10067
like = numpy.zeros(len(self.ns), dtype=numpy.double)
10068
for i in range(1000):
10069
like += self.SampleLikelihood(data)
10070
10071
self.probs *= like
10072
self.probs /= self.probs.sum()
10073
10074
m = len(data)
10075
self.params[:m] += data
10076
\end{code}
10077
10078
\py{SampleLikelihood} returns an array of likelihoods, one for each
10079
value of \py{n}. \py{like} accumulates the total likelihood for
10080
1000 samples. \py{self.probs} is multiplied by the total likelihood,
10081
then normalized. The last two lines, which update the parameters,
10082
are the same as in \py{Dirichlet.Update}.
10083
10084
Now let's look at \py{SampleLikelihood}. There are two
10085
opportunities for optimization here:
10086
10087
\begin{itemize}
10088
10089
\item When the hypothetical number of species, \py{n},
10090
exceeds the observed number, \py{m}, we only need the first \py{m}
10091
terms of the multinomial PMF; the rest are 1.
10092
10093
\item If the number of species is large, the likelihood of the data
10094
might be too small for floating-point (see ~\ref{underflow}). So it
10095
is safer to compute log-likelihoods.
10096
\index{log-likelihood} \index{underflow}
10097
10098
\end{itemize}
10099
10100
\index{multinomial distribution}
10101
Again, the multinomial PMF is
10102
%
10103
\[ c_x p_1^{x_1} \cdots p_n^{x_n} \]
10104
%
10105
So the log-likelihood is
10106
%
10107
\[ \log c_x + x_1 \log p_1 + \cdots + x_n \log p_n \]
10108
%
10109
which is fast and easy to compute. Again, $c_x$
10110
it is the same for all hypotheses, so we can drop it.
10111
Here's the code:
10112
\index{numpy}
10113
10114
\begin{code}
10115
# class Species2
10116
10117
def SampleLikelihood(self, data):
10118
gammas = numpy.random.gamma(self.params)
10119
10120
m = len(data)
10121
row = gammas[:m]
10122
col = numpy.cumsum(gammas)
10123
10124
log_likes = []
10125
for n in self.ns:
10126
ps = row / col[n-1]
10127
terms = data * numpy.log(ps)
10128
log_like = terms.sum()
10129
log_likes.append(log_like)
10130
10131
log_likes -= numpy.max(log_likes)
10132
likes = numpy.exp(log_likes)
10133
10134
coefs = [thinkbayes.BinomialCoef(n, m) for n in self.ns]
10135
likes *= coefs
10136
10137
return likes
10138
\end{code}
10139
10140
\py{gammas} is an array of values from a gamma distribution; its
10141
length is the largest hypothetical value of \py{n}. \py{row} is
10142
just the first \py{m} elements of \py{gammas}; since these are the
10143
only elements that depend on the data, they are the only ones we need.
10144
\index{gamma distribution}
10145
10146
For each value of \py{n} we need to divide \py{row} by the
10147
total of the first \py{n} values from \py{gamma}. \py{cumsum}
10148
computes these cumulative sums and stores them in \py{col}.
10149
\index{cumulative sum}
10150
10151
The loop iterates through the values of \py{n} and accumulates
10152
a list of log-likelihoods.
10153
\index{log-likelihood}
10154
10155
Inside the loop, \py{ps} contains the row of probabilities, normalized
10156
with the appropriate cumulative sum. \py{terms} contains the
10157
terms of the summation, $x_i \log p_i$, and \verb"log_like" contains
10158
their sum.
10159
10160
After the loop, we want to convert the log-likelihoods to linear
10161
likelihoods, but first it's a good idea to shift them so the largest
10162
log-likelihood is 0; that way the linear likelihoods are not too
10163
small (see ~\ref{underflow}).
10164
10165
Finally, before we return the likelihood, we have to apply a correction
10166
factor, which is the number of ways we could have observed these \py{m}
10167
species, if the total number of species is \py{n}.
10168
\py{BinomialCoefficient} computes ``n choose m'', which is written
10169
$\binom{n}{m}$.
10170
\index{binomial coefficient}
10171
10172
As often happens, the optimized version is less readable and more
10173
error-prone than the original. But that's one reason I think it is
10174
a good idea to start with the simple version; we can use it for
10175
regression testing. I plotted results from both versions and confirmed
10176
that they are approximately equal, and that they converge as the
10177
number of iterations increases.
10178
\index{regression testing}
10179
10180
10181
\section{One more problem}
10182
10183
There's more we could do to optimize this code, but there's another
10184
problem we need to fix first. As the number of observed
10185
species increases, this version gets noisier and takes more
10186
iterations to converge on a good answer.
10187
10188
The problem is that if the prevalences we choose from the Dirichlet
10189
distribution, the \py{ps}, are not at least approximately right,
10190
the likelihood of the observed data is close to zero and almost
10191
equally bad for all values of \py{n}. So most iterations don't
10192
provide any useful contribution to the total likelihood. And as the
10193
number of observed species, \py{m}, gets large, the probability of
10194
choosing \py{ps} with non-negligible likelihood gets small. Really
10195
small.
10196
10197
Fortunately, there is a solution. Remember that if you observe
10198
a set of data, you can update the prior distribution with the
10199
entire dataset, or you can break it up into a series of updates
10200
with subsets of the data, and the result is the same either way.
10201
10202
For this example, the key is to perform the updates one species at
10203
a time. That way when we generate a random set of \py{ps}, only
10204
one of them affects the computed likelihood, so the chance of choosing
10205
a good one is much better.
10206
10207
Here's a new version that updates one species at a time:
10208
\index{numpy}
10209
10210
\begin{code}
10211
class Species4(Species):
10212
10213
def Update(self, data):
10214
m = len(data)
10215
10216
for i in range(m):
10217
one = numpy.zeros(i+1)
10218
one[i] = data[i]
10219
Species.Update(self, one)
10220
\end{code}
10221
10222
This version inherits \verb"__init__" from \py{Species}, so it
10223
represents the hypotheses as a list of Dirichlet objects (unlike
10224
\py{Species2}).
10225
10226
\py{Update} loops through the observed species and makes an
10227
array, \py{one}, with all zeros and one species count. Then
10228
it calls \py{Update} in the parent class, which computes
10229
the likelihoods and updates the sub-hypotheses.
10230
10231
So in the running example, we do three updates. The first
10232
is something like ``I have seen three lions.'' The second is
10233
``I have seen two tigers and no additional lions.'' And the third
10234
is ``I have seen one bear and no more lions and tigers.''
10235
10236
Here's the new version of \py{Likelihood}:
10237
10238
\begin{code}
10239
# class Species4
10240
10241
def Likelihood(self, data, hypo):
10242
dirichlet = hypo
10243
like = 0
10244
for i in range(self.iterations):
10245
like += dirichlet.Likelihood(data)
10246
10247
# correct for the number of unseen species the new one
10248
# could have been
10249
m = len(data)
10250
num_unseen = dirichlet.n - m + 1
10251
like *= num_unseen
10252
10253
return like
10254
\end{code}
10255
10256
This is almost the same as \py{Species.Likelihood}. The difference
10257
is the factor, \verb"num_unseen". This correction is necessary
10258
because each time we see a species for the first time, we have to
10259
consider that there were some number of other unseen species that
10260
we might have seen. For larger values of \py{n} there are more
10261
unseen species that we could have seen, which increases the likelihood
10262
of the data.
10263
10264
This is a subtle point and I have to admit that I did not get it right
10265
the first time. But again I was able to validate this version
10266
by comparing it to the previous versions.
10267
\index{regression testing}
10268
10269
10270
\section{We're not done yet}
10271
10272
\newcommand{\BigO}[1]{\mathcal{O}(#1)}
10273
10274
Performing the updates one species at a time solves one problem, but
10275
it creates another. Each update takes time proportional to $k m$,
10276
where $k$ is the number of hypotheses and $m$ is the number of observed
10277
species. So if we do $m$ updates, the total run time is
10278
proportional to $k m^2$.
10279
10280
But we can speed things up using the same trick we used in
10281
Section~\ref{collapsing}: we'll get rid of the Dirichlet objects and
10282
collapse the two levels of the hierarchy into a single object. So
10283
here's yet another version of \py{Species}:
10284
10285
\begin{code}
10286
class Species5(Species2):
10287
10288
def Update(self, data):
10289
m = len(data)
10290
for i in range(m):
10291
self.UpdateOne(i+1, data[i])
10292
self.params[i] += data[i]
10293
\end{code}
10294
10295
This version inherits \verb"__init__" from \py{Species2}, so
10296
it uses \py{ns} and \py{probs} to represent the distribution
10297
of \py{n}, and \py{params} to represent the parameters of
10298
the Dirichlet distribution.
10299
10300
\py{Update} is similar to what we saw in the previous section.
10301
It loops through the observed species and calls \py{UpdateOne}:
10302
\index{numpy}
10303
10304
\begin{code}
10305
# class Species5
10306
10307
def UpdateOne(self, i, count):
10308
likes = numpy.zeros(len(self.ns), dtype=numpy.double)
10309
for i in range(self.iterations):
10310
likes += self.SampleLikelihood(i, count)
10311
10312
unseen_species = [n-i+1 for n in self.ns]
10313
likes *= unseen_species
10314
10315
self.probs *= likes
10316
self.probs /= self.probs.sum()
10317
\end{code}
10318
10319
This function is similar to \py{Species2.Update}, with two changes:
10320
10321
\begin{itemize}
10322
10323
\item The interface is different. Instead of the whole dataset, we
10324
get \py{i}, the index of the observed species, and \py{count},
10325
how many of that species we've seen.
10326
10327
\item We have to apply a correction factor for the number of unseen
10328
species, as in \py{Species4.Likelihood}. The difference here is
10329
that we update all of the likelihoods at once with array
10330
multiplication.
10331
10332
\end{itemize}
10333
10334
Finally, here's \py{SampleLikelihood}:
10335
\index{numpy}
10336
10337
\begin{code}
10338
# class Species5
10339
10340
def SampleLikelihood(self, i, count):
10341
gammas = numpy.random.gamma(self.params)
10342
10343
sums = numpy.cumsum(gammas)[self.ns[0]-1:]
10344
10345
ps = gammas[i-1] / sums
10346
log_likes = numpy.log(ps) * count
10347
10348
log_likes -= numpy.max(log_likes)
10349
likes = numpy.exp(log_likes)
10350
10351
return likes
10352
\end{code}
10353
10354
This is similar to \py{Species2.SampleLikelihood}; the
10355
difference is that each update only includes a single species,
10356
so we don't need a loop.
10357
10358
The runtime of this function is proportional to the number
10359
of hypotheses, $k$. It runs $m$ times, so the run time of
10360
the update is proportional to $k m$.
10361
And the number of iterations we
10362
need to get an accurate result is usually small.
10363
10364
10365
\section{The belly button data}
10366
\label{belly}
10367
10368
That's enough about lions and tigers and bears.
10369
Let's get back to belly buttons. To get a sense of what the
10370
data look like, consider subject B1242,
10371
whose sample of 400 reads yielded 61 species with the following
10372
counts:
10373
10374
\begin{code}
10375
92, 53, 47, 38, 15, 14, 12, 10, 8, 7, 7, 5, 5,
10376
4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
10377
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
10378
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
10379
\end{code}
10380
10381
There are a few dominant species that make up a large
10382
fraction of the whole, but many species that yielded only
10383
a single read. The number of these ``singletons'' suggests
10384
that there are likely to be at least a few unseen species.
10385
\index{species}
10386
10387
In the example with lions and tigers, we assume that each
10388
animal in the preserve is equally likely to be observed.
10389
Similarly, for the belly button data, we assume that each
10390
bacterium is equally likely to yield a read.
10391
10392
In reality, each step in the data-collection
10393
process might introduce biases. Some species might
10394
be more likely to be picked up by a swab, or to yield identifiable
10395
amplicons. So when we talk about the prevalence of each species,
10396
we should remember this source of error.
10397
\index{sample bias}
10398
10399
I should also acknowledge that I am using the term ``species''
10400
loosely. First, bacterial species are not well defined. Second,
10401
some reads identify a particular species, others only identify
10402
a genus. To be more precise, I should say ``operational
10403
taxonomic unit'', or OTU.
10404
\index{operational taxonomic unit}
10405
\index{OTU}
10406
10407
Now let's process some of the belly button data. I define
10408
a class called \py{Subject} to represent information about
10409
each subject in the study:
10410
10411
\begin{code}
10412
class Subject(object):
10413
10414
def __init__(self, code):
10415
self.code = code
10416
self.species = []
10417
\end{code}
10418
10419
Each subject has a string code, like ``B1242'', and a list of
10420
(count, species name) pairs, sorted in increasing order by count.
10421
\py{Subject} provides several methods to make it
10422
easy to access these counts and species names. You can see the details
10423
in \url{http://thinkbayes.com/species.py}.
10424
For more information
10425
see Section~\ref{download}.
10426
10427
\begin{figure}
10428
% species.py
10429
\centerline{\includegraphics[height=2.5in]{figs/species-ndist-B1242.pdf}}
10430
\caption{Distribution of \py{n} for subject B1242.}
10431
\label{species-ndist}
10432
\end{figure}
10433
10434
\py{Subject} provides a method named \py{Process} that creates and
10435
updates a \py{Species5} suite,
10436
which represents the distributions of \py{n} and the prevalences.
10437
\index{prevalence}
10438
10439
And \py{Suite2} provides \py{DistOfN}, which returns the posterior
10440
distribution of \py{n}.
10441
10442
\begin{code}
10443
# class Suite2
10444
10445
def DistN(self):
10446
items = zip(self.ns, self.probs)
10447
pmf = thinkbayes.MakePmfFromItems(items)
10448
return pmf
10449
\end{code}
10450
10451
Figure~\ref{species-ndist} shows the distribution of \py{n} for
10452
subject B1242. The probability that there are exactly 61 species, and
10453
no unseen species, is nearly zero. The most likely value is 72, with
10454
90\% credible interval 66 to 79. At the high end, it is unlikely that
10455
there are as many as 87 species.
10456
10457
Next we compute the posterior distribution of prevalence for
10458
each species. \py{Species2} provides \py{DistOfPrevalence}:
10459
10460
\begin{code}
10461
# class Species2
10462
10463
def DistOfPrevalence(self, index):
10464
metapmf = thinkbayes.Pmf()
10465
10466
for n, prob in zip(self.ns, self.probs):
10467
beta = self.MarginalBeta(n, index)
10468
pmf = beta.MakePmf()
10469
metapmf.Set(pmf, prob)
10470
10471
mix = thinkbayes.MakeMixture(metapmf)
10472
return metapmf, mix
10473
\end{code}
10474
10475
\py{index} indicates which species we want. For each
10476
\py{n}, we have a different posterior distribution
10477
of prevalence.
10478
10479
\begin{figure}
10480
% species.py
10481
\centerline{\includegraphics[height=2.5in]{figs/species-prev-B1242.pdf}}
10482
\caption{Distribution of prevalences for subject B1242.}
10483
\label{species-prev}
10484
\end{figure}
10485
10486
The loop iterates through the possible values of \py{n}
10487
and their probabilities. For each value of \py{n} it gets
10488
a Beta object representing the marginal distribution for the
10489
indicated species. Remember that Beta objects contain the
10490
parameters \py{alpha} and \py{beta}; they don't have
10491
values and probabilities like a Pmf, but they provide \py{MakePmf},
10492
which generates a discrete approximation to the continuous
10493
beta distribution.
10494
\index{Beta object}
10495
10496
\py{metapmf} is a meta-Pmf that contains the distributions
10497
of prevalence, conditioned on \py{n}. \py{MakeMixture}
10498
combines the meta-Pmf into \py{mix}, which combines the
10499
conditional distributions into a single distribution
10500
of prevalence.
10501
\index{meta-Pmf}
10502
\index{mixture}
10503
\index{MakeMixture}
10504
10505
Figure~\ref{species-prev} shows results for the five
10506
species with the most reads. The most prevalent species accounts for
10507
23\% of the 400 reads, but since there are almost certainly unseen
10508
species, the most likely estimate for its prevalence is 20\%,
10509
with 90\% credible interval between 17\% and 23\%.
10510
10511
10512
\section{Predictive distributions}
10513
10514
\begin{figure}
10515
% species.py
10516
\centerline{\includegraphics[height=2.5in]{figs/species-rare-B1242.pdf}}
10517
\caption{Simulated rarefaction curves for subject B1242.}
10518
\label{species-rare}
10519
\end{figure}
10520
10521
I introduced the hidden species problem in the form of four related
10522
questions. We have answered the first two by computing the posterior
10523
distribution for \py{n} and the prevalence of each species.
10524
\index{predictive distribution}
10525
10526
The other two questions are:
10527
10528
\begin{itemize}
10529
10530
\item If we are planning to collect additional reads, can we predict
10531
how many new species we are likely to discover?
10532
10533
\item How many additional reads are needed to increase the
10534
fraction of observed species to a given threshold?
10535
10536
\end{itemize}
10537
10538
To answer predictive questions like this we can use the posterior
10539
distributions to simulate possible future events and compute
10540
predictive distributions for the number of species, and fraction of
10541
the total, we are likely to see.
10542
10543
The kernel of these simulations looks like this:
10544
\index{simulation}
10545
10546
\begin{enumerate}
10547
10548
\item Choose \py{n} from its posterior distribution.
10549
10550
\item Choose a prevalence for each species, including possible unseen
10551
species, using the Dirichlet distribution.
10552
\index{Dirichlet distribution}
10553
10554
\item Generate a random sequence of future observations.
10555
10556
\item Compute the number of new species, \verb"num_new", as a function
10557
of the number of additional reads, \py{k}.
10558
10559
\item Repeat the previous steps and accumulate the joint distribution
10560
of \verb"num_new" and \py{k}.
10561
\index{joint distribution}
10562
10563
\end{enumerate}
10564
10565
And here's the code. \py{RunSimulation} runs a single simulation:
10566
10567
\begin{code}
10568
# class Subject
10569
10570
def RunSimulation(self, num_reads):
10571
m, seen = self.GetSeenSpecies()
10572
n, observations = self.GenerateObservations(num_reads)
10573
10574
curve = []
10575
for k, obs in enumerate(observations):
10576
seen.add(obs)
10577
10578
num_new = len(seen) - m
10579
curve.append((k+1, num_new))
10580
10581
return curve
10582
\end{code}
10583
10584
\verb"num_reads" is the number of additional reads to simulate.
10585
\py{m} is the number of seen species, and \py{seen} is a set of
10586
strings with a unique name for each species.
10587
\py{n} is a random value from the posterior distribution, and
10588
\py{observations} is a random sequence of species names.
10589
10590
Each time through the loop, we add the new observation to
10591
\py{seen} and record the number of reads and the number of
10592
new species so far.
10593
10594
The result of \py{RunSimulation} is a {\bf rarefaction curve},
10595
represented as a list of pairs with the number of reads and
10596
the number of new species.
10597
\index{rarefaction curve}
10598
10599
Before we see the results, let's look at \py{GetSeenSpecies} and
10600
\py{GenerateObservations}.
10601
10602
\begin{code}
10603
#class Subject
10604
10605
def GetSeenSpecies(self):
10606
names = self.GetNames()
10607
m = len(names)
10608
seen = set(SpeciesGenerator(names, m))
10609
return m, seen
10610
\end{code}
10611
10612
\py{GetNames} returns the list of species names that appear in
10613
the data files, but for many subjects these names are not unique.
10614
So I use \py{SpeciesGenerator} to extend each name with a serial
10615
number:
10616
\index{generator}
10617
10618
\begin{code}
10619
def SpeciesGenerator(names, num):
10620
i = 0
10621
for name in names:
10622
yield '%s-%d' % (name, i)
10623
i += 1
10624
10625
while i < num:
10626
yield 'unseen-%d' % i
10627
i += 1
10628
\end{code}
10629
10630
Given a name like \py{Corynebacterium}, \py{SpeciesGenerator} yields
10631
\py{Corynebacterium-1}. When the list of names is exhausted, it
10632
yields names like \py{unseen-62}.
10633
10634
Here is \py{GenerateObservations}:
10635
10636
\begin{code}
10637
# class Subject
10638
10639
def GenerateObservations(self, num_reads):
10640
n, prevalences = self.suite.SamplePosterior()
10641
10642
names = self.GetNames()
10643
name_iter = SpeciesGenerator(names, n)
10644
10645
d = dict(zip(name_iter, prevalences))
10646
cdf = thinkbayes.MakeCdfFromDict(d)
10647
observations = cdf.Sample(num_reads)
10648
10649
return n, observations
10650
\end{code}
10651
10652
Again, \verb"num_reads" is the number of additional reads
10653
to generate. \py{n} and \py{prevalences} are samples from
10654
the posterior distribution.
10655
10656
\py{cdf} is a Cdf object that maps species names, including the
10657
unseen, to cumulative probabilities. Using a Cdf makes it efficient
10658
to generate a random sequence of species names.
10659
\index{Cdf}
10660
\index{cumulative probability}
10661
10662
Finally, here is \py{Species2.SamplePosterior}:
10663
10664
\begin{code}
10665
def SamplePosterior(self):
10666
pmf = self.DistOfN()
10667
n = pmf.Random()
10668
prevalences = self.SamplePrevalences(n)
10669
return n, prevalences
10670
\end{code}
10671
10672
And \py{SamplePrevalences}, which generates a sample of
10673
prevalences conditioned on \py{n}:
10674
\index{numpy}
10675
\index{random sample}
10676
10677
\begin{code}
10678
# class Species2
10679
10680
def SamplePrevalences(self, n):
10681
params = self.params[:n]
10682
gammas = numpy.random.gamma(params)
10683
gammas /= gammas.sum()
10684
return gammas
10685
\end{code}
10686
10687
We saw this algorithm for generating random values from a Dirichlet
10688
distribution in Section~\ref{randomdir}.
10689
10690
Figure~\ref{species-rare} shows 100 simulated rarefaction curves
10691
for subject B1242. The curves are ``jittered;''
10692
that is, I shifted each curve by a random offset so they
10693
would not all overlap. By inspection we can estimate that after
10694
400 more reads we are likely to find 2--6 new species.
10695
10696
10697
\section{Joint posterior}
10698
10699
\begin{figure}
10700
% species.py
10701
\centerline{\includegraphics[height=2.5in]{figs/species-cond-B1242.pdf}}
10702
\caption{Distributions of the number of new species conditioned on
10703
the number of additional reads.}
10704
\label{species-cond}
10705
\end{figure}
10706
10707
We can use these simulations to estimate the
10708
joint distribution of \verb"num_new" and \py{k}, and from that
10709
we can get the distribution of \verb"num_new" conditioned on any
10710
value of \py{k}.
10711
\index{joint distribution}
10712
10713
\begin{code}
10714
def MakeJointPredictive(curves):
10715
joint = thinkbayes.Joint()
10716
for curve in curves:
10717
for k, num_new in curve:
10718
joint.Incr((k, num_new))
10719
joint.Normalize()
10720
return joint
10721
\end{code}
10722
10723
\py{MakeJointPredictive} makes a Joint object, which is a
10724
Pmf whose values are tuples.
10725
\index{Joint object}
10726
10727
\py{curves} is a list of rarefaction curves created by
10728
\py{RunSimulation}. Each curve contains a list of pairs of
10729
\py{k} and \verb"num_new".
10730
\index{rarefaction curve}
10731
10732
The resulting joint distribution is a map from each pair to
10733
its probability of occurring. Given the joint distribution, we
10734
can use \py{Joint.Conditional}
10735
get the distribution of \verb"num_new" conditioned on \py{k}
10736
(see Section~\ref{conditional}).
10737
\index{conditional distribution}
10738
10739
\py{Subject.MakeConditionals} takes a list of \py{ks}
10740
and computes the conditional distribution of \verb"num_new"
10741
for each \py{k}. The result is a list of Cdf objects.
10742
10743
\begin{code}
10744
def MakeConditionals(curves, ks):
10745
joint = MakeJointPredictive(curves)
10746
10747
cdfs = []
10748
for k in ks:
10749
pmf = joint.Conditional(1, 0, k)
10750
pmf.name = 'k=%d' % k
10751
cdf = pmf.MakeCdf()
10752
cdfs.append(cdf)
10753
10754
return cdfs
10755
\end{code}
10756
10757
Figure~\ref{species-cond} shows the results. After 100 reads, the
10758
median predicted number of new species is 2; the 90\% credible
10759
interval is 0 to 5. After 800 reads, we expect to see 3 to 12 new
10760
species.
10761
10762
10763
\section{Coverage}
10764
10765
\begin{figure}
10766
% species.py
10767
\centerline{\includegraphics[height=2.5in]{figs/species-frac-B1242.pdf}}
10768
\caption{Complementary CDF of coverage for a range of additional reads.}
10769
\label{species-frac}
10770
\end{figure}
10771
10772
The last question we want to answer is, ``How many additional reads
10773
are needed to increase the fraction of observed species to a given
10774
threshold?''
10775
\index{coverage}
10776
10777
To answer this question, we need a version of \py{RunSimulation}
10778
that computes the fraction of observed species rather than the
10779
number of new species.
10780
10781
\begin{code}
10782
# class Subject
10783
10784
def RunSimulation(self, num_reads):
10785
m, seen = self.GetSeenSpecies()
10786
n, observations = self.GenerateObservations(num_reads)
10787
10788
curve = []
10789
for k, obs in enumerate(observations):
10790
seen.add(obs)
10791
10792
frac_seen = len(seen) / float(n)
10793
curve.append((k+1, frac_seen))
10794
10795
return curve
10796
\end{code}
10797
10798
Next we loop through each curve and make a dictionary, \py{d},
10799
that maps from the number of additional reads, \py{k}, to
10800
a list of \py{fracs}; that is, a list of values for the
10801
coverage achieved after \py{k} reads.
10802
10803
\begin{code}
10804
def MakeFracCdfs(self, curves):
10805
d = {}
10806
for curve in curves:
10807
for k, frac in curve:
10808
d.setdefault(k, []).append(frac)
10809
10810
cdfs = {}
10811
for k, fracs in d.iteritems():
10812
cdf = thinkbayes.MakeCdfFromList(fracs)
10813
cdfs[k] = cdf
10814
10815
return cdfs
10816
\end{code}
10817
10818
Then for each value of \py{k} we make a Cdf of \py{fracs}; this Cdf
10819
represents the distribution of coverage after \py{k} reads.
10820
10821
Remember that the CDF tells you the probability of falling below a
10822
given threshold, so the {\em complementary} CDF tells you the
10823
probability of exceeding it. Figure~\ref{species-frac} shows
10824
complementary CDFs for a range of values of \py{k}.
10825
\index{complementary CDF}
10826
10827
To read this figure, select the level of coverage you want to achieve
10828
along the $x$-axis. As an example, choose 90\%.
10829
\index{coverage}
10830
10831
Now you can read up the chart to find the probability of achieving
10832
90\% coverage after \py{k} reads. For example, with 200 reads,
10833
you have about a 40\% chance of getting 90\% coverage. With 1000 reads, you
10834
have a 90\% chance of getting 90\% coverage.
10835
10836
With that, we have answered the four questions that make up the unseen
10837
species problem. To validate the algorithms in this chapter with
10838
real data, I had to deal with a few more details. But
10839
this chapter is already too long, so I won't discuss them here.
10840
10841
You can read about the problems, and how I addressed them, at
10842
\url{http://allendowney.blogspot.com/2013/05/belly-button-biodiversity-end-game.html}.
10843
10844
You can download the code in this chapter from
10845
\url{http://thinkbayes.com/species.py}.
10846
For more information
10847
see Section~\ref{download}.
10848
10849
10850
\section{Discussion}
10851
10852
The Unseen Species problem is an area of active research, and I
10853
believe the algorithm in this chapter is a novel contribution. So in
10854
fewer than 200 pages we have made it from the basics of probability to
10855
the research frontier. I'm very happy about that.
10856
10857
My goal for this book is to present three related ideas:
10858
10859
\begin{itemize}
10860
10861
\item {\bf Bayesian thinking}: The foundation of Bayesian analysis is
10862
the idea of using probability distributions to represent uncertain
10863
beliefs, using data to update those distributions, and using the
10864
results to make predictions and inform decisions.
10865
10866
\item {\bf A computational approach}: The premise of this book is that
10867
it is easier to understand Bayesian analysis using computation
10868
rather than math, and easier to implement Bayesian methods with
10869
reusable building blocks that can be rearranged to solve real-world
10870
problems quickly.
10871
10872
\item {\bf Iterative modeling}: Most real-world problems involve
10873
modeling decisions and trade-offs between realism and complexity.
10874
It is often impossible to know ahead of time what factors should be
10875
included in the model and which can be abstracted away. The best
10876
approach is to iterate, starting with simple models and adding
10877
complexity gradually, using each model to validate the others.
10878
10879
\end{itemize}
10880
10881
These ideas are versatile and powerful; they are applicable to
10882
problems in every area of science and engineering, from simple
10883
examples to topics of current research.
10884
10885
If you made it this far, you should be prepared to apply these
10886
tools to new problems relevant to your work. I hope you find
10887
them useful; let me know how it goes!
10888
10889
10890
10891
%\chapter{Future chapters}
10892
10893
%Bayesian regression (hybrid version with resampling?)
10894
%\url{http://www.reddit.com/r/statistics/comments/1647yj/which_regression_technique/}
10895
10896
%Change point detection:
10897
10898
%Deconvolution: Estimating round trip times
10899
10900
%Bayesian search
10901
10902
%Extension of the Euro problem: evaluating reddit items and redditors
10903
%\url{http://www.reddit.com/r/statistics/comments/15rurz/question_about_continuous_bayesian_inference/}
10904
10905
%Charles Darwin problem (capture-tag-recapture)
10906
%\url{http://maximum-entropy-blog.blogspot.com/2012/04/capture-recapture-and-charles-darwin.html}
10907
10908
% http://camdp.com/blogs/how-solve-price-rights-showdown
10909
10910
% https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
10911
10912
% http://blog.yhathq.com/posts/estimating-user-lifetimes-with-pymc.html
10913
10914
\printindex
10915
10916
\end{document}
10917
10918