Contact Us!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

| Download
Views: 106
Image: ubuntu2004
Kernel: Python 3 (system-wide)

Lab 10: Bayesian Statistics

In order to understand how Bayesian statistics works, we're going to start with a simple scenario. Imagine you are studying coastal birds of southern California. Two models describing the abundance of seagulls and cormorants have been proposed. Model 1 predicts that you will observe 75% cormorants and 25% seagulls, while model 2 predicts that each species will make up 50% of your observations. Heading out to Ballona Creek to observe birds, you want to know the probability that model 1 is true.

Let's call "model 1 being true" event A and "observing a bird (either seagull or cormorant)" event B. We can easily find p(BA)p(B|A), but using this information to determine the probability that model 1 is correct requires an equation known as Bayes' theorem.

p(AB)=p(BA)p(A)p(B)p(A|B) = \frac{p(B|A)p(A)}{p(B)}

In this lab, you will first compute p(AB)p(A|B) and then update this probability based on new evidence (observing more birds).

  1. Import Numpy and Seaborn.

import numpy as np import seaborn as sns
  1. Set up an array for each model with the probabilities of observing cormorants and seagulls.

prob_model1 = np.array([0.75,0.25]) prob_model2 = np.array([0.50,0.50])
  1. If you know nothing that would favor one model over the other, what is the probability that model 1 is best? Set up another array that has the probabilities of each model.

prob_model1_best = np.array([0.50,0.50])

We need one more piece of information: p(B)p(B). This is the overall probability of observing a cormorant, regardless of model. It is the sum of the probabilities of observing a cormorant predicted by each model, weighted by the relative probabilities of the models. The general formula is p(B)=p(BA)p(A)+p(B¬A)p(¬A)p(B) = p(B|A)p(A) + p(B|\lnot A)p(\lnot A), where ¬A\lnot A means "not A" -- in this case, "model 2 being true".

  1. Find p(B).

pC = (prob_model1[0] * prob_model1_best[0]) + (prob_model2[0] * prob_model1_best[1]) pS = 1-pC pC
0.625
  1. Now, use Bayes' theorem to find the probability that model 1 is true given that you observed a cormorant.

prob_model1_givenC = prob_model1[0] * prob_model1_best[0] / pC prob_model1_givenC
0.6
  1. Update the array holding the probability of each model to reflect your new knowledge. Make sure to change both numbers!

prob_model1_best[0] = prob_model1_givenC #redefining p(M1) with p(M1|C) prob_model1_best[1] = 1 - prob_model1_givenC prob_model1_best
array([0.6, 0.4])
  1. The next bird you observe is also a cormorant. Now, what is the probability of model 1 being true? Do this calculation and update the probability array using arrays and indexing rather than typing in numbers. (This is a step toward automating the process.)

pC = (prob_model1[0] * prob_model1_best[0]) + (prob_model2[0] * prob_model1_best[1]) pS = 1-pC prob_model1_givenC = prob_model1[0] * prob_model1_best[0] / pC prob_model1_givenC
0.6923076923076923
  1. You now observe a seagull. Update the probabilities again, also without hard-coding numbers.

prob_model1_best[0] = prob_model1_givenC #redefining p(M1) with p(M1|C) prob_model1_best[1] = 1 - prob_model1_givenC prob_model1_best
array([0.69230769, 0.30769231])
pC = (prob_model1[0] * prob_model1_best[0]) + (prob_model2[0] * prob_model1_best[1]) pS = 1-pC prob_model1_givenS = prob_model1[1] * prob_model1_best[0] / pS prob_model1_givenS
0.5294117647058822
prob_model1_best[0] = prob_model1_givenS prob_model1_best[1] = 1 - prob_model1_givenS prob_model1_best
array([0.52941176, 0.47058824])
  1. Write a script or function that will perform the updating automatically for a sequence of birds. Test it on a sequence of 3 (you can use a list of Cs and Ss).

prob_M1 = np.array( [ 0.75, 0.25 ] ) prob_M2 = np.array( [ 0.50, 0.50 ] ) prior = np.array( [ 0.5, 0.5 ] ) ppv_value = [] test = ["C","C","S"] for bird in test: if bird == 'C': pC = prob_model1[0] * prob_model1_best[0] + prob_model2[0] * prob_model1_best[1] pS = 1 - pC prob_model1_givenC = prob_M1[0] * prior[0] / pC prior[0] = prob_model1_givenC prior[1] = 1 - prob_model1_givenC else: pC = prob_model1[0] * prob_model1_best[0] + prob_model2[0] * prob_model1_best[1] pS = 1 - pC prob_model1_givenS = prob_model1[1] * prob_model1_best[0] / pS prior[0] = prob_model1_givenS prior[1] = 1 - prob_model1_givenS ppv_value.append(prior[0]) print(ppv_value)
[0.5930232558139535, 0.7033531638723635, 0.3599999999999999]
  1. Starting with a prior of 0.5 for each model, use the list given below to update your array of probabilities for the models. Make a list of p(AB)p(A|B) values.

test = ['C', 'S', 'C', 'C', 'C', 'C', 'C', 'C', 'S', 'C', 'C', 'C', 'S', 'C', 'S', 'C', 'C', 'C', 'C', 'C'] prob_M1 = np.array( [ 0.75, 0.25 ] ) prob_M2 = np.array( [ 0.50, 0.50 ] ) prior = np.array( [ 0.5, 0.5 ] ) ppv_value = [] for bird in test: if bird == 'C': pC = prob_model1[0] * prob_model1_best[0] + prob_model2[0] * prob_model1_best[1] pS = 1 - pC prob_model1_givenC = prob_M1[0] * prior[0] / pC prior[0] = prob_model1_givenC prior[1] = 1 - prob_model1_givenC else: pC = prob_model1[0] * prob_model1_best[0] + prob_model2[0] * prob_model1_best[1] pS = 1 - pC prob_model1_givenS = prob_model1[1] * prob_model1_best[0] / pS prior[0] = prob_model1_givenS prior[1] = 1 - prob_model1_givenS ppv_value.append(prior[0]) print(ppv_value)
[0.5930232558139535, 0.3599999999999999, 0.42697674418604636, 0.5064142779881016, 0.6006308878463531, 0.7123761693061398, 0.8449112705723985, 1.002104065097496, 0.3599999999999999, 0.42697674418604636, 0.5064142779881016, 0.6006308878463531, 0.3599999999999999, 0.42697674418604636, 0.3599999999999999, 0.42697674418604636, 0.5064142779881016, 0.6006308878463531, 0.7123761693061398, 0.8449112705723985]
  1. Plot your list using the sns.lineplot command. The basic syntax is sns.lineplot(x,y). You can use range to generate a list of x-values and use your list of p(AB)p(A|B) values as y-values.

#TODO
  1. Describe what you see in the graph. How does the probability of model 1 change as we collect new evidence?

#TODO