Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Project: Paul Jasper Torres - LS 40 Section 1E

Path: Week 10/Lab 10 - Bayesian.ipynb

Views: ¹⁰⁶
Image: ubuntu2004

Kernel: Python 3 (system-wide)

Lab 10: Bayesian Statistics

In order to understand how Bayesian statistics works, we're going to start with a simple scenario. Imagine you are studying coastal birds of southern California. Two models describing the abundance of seagulls and cormorants have been proposed. Model 1 predicts that you will observe 75% cormorants and 25% seagulls, while model 2 predicts that each species will make up 50% of your observations. Heading out to Ballona Creek to observe birds, you want to know the probability that model 1 is true.

Let's call "model 1 being true" event A and "observing a bird (either seagull or cormorant)" event B. We can easily find $p(B|A)$ , but using this information to determine the probability that model 1 is correct requires an equation known as Bayes' theorem.

p(A|B) = \frac{p(B|A)p(A)}{p(B)}

In this lab, you will first compute $p(A|B)$ and then update this probability based on new evidence (observing more birds).

Import Numpy and Seaborn.

In [1]:

import numpy as np
import seaborn as sns

Set up an array for each model with the probabilities of observing cormorants and seagulls.

In [3]:

prob_model1 = np.array([0.75,0.25])
prob_model2 = np.array([0.50,0.50])

If you know nothing that would favor one model over the other, what is the probability that model 1 is best? Set up another array that has the probabilities of each model.

In [4]:

prob_model1_best = np.array([0.50,0.50])

We need one more piece of information: $p(B)$ . This is the overall probability of observing a cormorant, regardless of model. It is the sum of the probabilities of observing a cormorant predicted by each model, weighted by the relative probabilities of the models. The general formula is $p(B) = p(B|A)p(A) + p(B|\lnot A)p(\lnot A)$ , where $\lnot A$ means "not A" -- in this case, "model 2 being true".

Find p(B).

In [8]:

pC = (prob_model1[0] * prob_model1_best[0]) + (prob_model2[0] * prob_model1_best[1])
pS = 1-pC

pC

0.625

Now, use Bayes' theorem to find the probability that model 1 is true given that you observed a cormorant.

In [11]:

prob_model1_givenC = prob_model1[0] * prob_model1_best[0] / pC
prob_model1_givenC

0.6

Update the array holding the probability of each model to reflect your new knowledge. Make sure to change both numbers!

In [13]:

prob_model1_best[0] = prob_model1_givenC #redefining p(M1) with p(M1|C)
prob_model1_best[1] = 1 - prob_model1_givenC
prob_model1_best

array([0.6, 0.4])

The next bird you observe is also a cormorant. Now, what is the probability of model 1 being true? Do this calculation and update the probability array using arrays and indexing rather than typing in numbers. (This is a step toward automating the process.)

In [14]:

pC = (prob_model1[0] * prob_model1_best[0]) + (prob_model2[0] * prob_model1_best[1])
pS = 1-pC

prob_model1_givenC = prob_model1[0] * prob_model1_best[0] / pC
prob_model1_givenC

0.6923076923076923

You now observe a seagull. Update the probabilities again, also without hard-coding numbers.

In [19]:

prob_model1_best[0] = prob_model1_givenC #redefining p(M1) with p(M1|C)
prob_model1_best[1] = 1 - prob_model1_givenC
prob_model1_best

array([0.69230769, 0.30769231])

In [20]:

pC = (prob_model1[0] * prob_model1_best[0]) + (prob_model2[0] * prob_model1_best[1])
pS = 1-pC

prob_model1_givenS = prob_model1[1] * prob_model1_best[0] / pS
prob_model1_givenS

0.5294117647058822

In [21]:

prob_model1_best[0] = prob_model1_givenS
prob_model1_best[1] = 1 - prob_model1_givenS
prob_model1_best

array([0.52941176, 0.47058824])

Write a script or function that will perform the updating automatically for a sequence of birds. Test it on a sequence of 3 (you can use a list of Cs and Ss).

In [30]:

prob_M1 = np.array( [ 0.75, 0.25 ] )
prob_M2 = np.array( [ 0.50, 0.50 ] )
prior = np.array( [ 0.5, 0.5 ] )

ppv_value = []

test = ["C","C","S"]

for bird in test:
    if bird == 'C':
        pC = prob_model1[0] * prob_model1_best[0] + prob_model2[0] * prob_model1_best[1]
        pS = 1 - pC
        prob_model1_givenC = prob_M1[0] * prior[0] / pC
        prior[0] = prob_model1_givenC
        prior[1] = 1 - prob_model1_givenC
    else:
        pC = prob_model1[0] * prob_model1_best[0] + prob_model2[0] * prob_model1_best[1]
        pS = 1 - pC
        prob_model1_givenS = prob_model1[1] * prob_model1_best[0] / pS
        prior[0] = prob_model1_givenS
        prior[1] = 1 - prob_model1_givenS
    ppv_value.append(prior[0])
print(ppv_value)

[0.5930232558139535, 0.7033531638723635, 0.3599999999999999]

Starting with a prior of 0.5 for each model, use the list given below to update your array of probabilities for the models. Make a list of $p(A|B)$ values.

In [32]:

test = ['C', 'S', 'C', 'C', 'C', 'C', 'C', 'C', 'S', 'C', 'C', 'C', 'S', 'C', 'S', 'C', 'C', 'C', 'C', 'C']

prob_M1 = np.array( [ 0.75, 0.25 ] )
prob_M2 = np.array( [ 0.50, 0.50 ] )
prior = np.array( [ 0.5, 0.5 ] )

ppv_value = []

for bird in test:
    if bird == 'C':
        pC = prob_model1[0] * prob_model1_best[0] + prob_model2[0] * prob_model1_best[1]
        pS = 1 - pC
        prob_model1_givenC = prob_M1[0] * prior[0] / pC
        prior[0] = prob_model1_givenC
        prior[1] = 1 - prob_model1_givenC
    else:
        pC = prob_model1[0] * prob_model1_best[0] + prob_model2[0] * prob_model1_best[1]
        pS = 1 - pC
        prob_model1_givenS = prob_model1[1] * prob_model1_best[0] / pS
        prior[0] = prob_model1_givenS
        prior[1] = 1 - prob_model1_givenS
    ppv_value.append(prior[0])
print(ppv_value)

[0.5930232558139535, 0.3599999999999999, 0.42697674418604636, 0.5064142779881016, 0.6006308878463531, 0.7123761693061398, 0.8449112705723985, 1.002104065097496, 0.3599999999999999, 0.42697674418604636, 0.5064142779881016, 0.6006308878463531, 0.3599999999999999, 0.42697674418604636, 0.3599999999999999, 0.42697674418604636, 0.5064142779881016, 0.6006308878463531, 0.7123761693061398, 0.8449112705723985]

Plot your list using the sns.lineplot command. The basic syntax is sns.lineplot(x,y). You can use range to generate a list of x-values and use your list of $p(A|B)$ values as y-values.

In [24]:

#TODO

Describe what you see in the graph. How does the probability of model 1 change as we collect new evidence?

In [0]:

#TODO

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Lab 10: Bayesian Statistics

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Lab 10: Bayesian Statistics

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.