Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
uob-COMS30035
GitHub Repository: uob-COMS30035/lab_sheets_public
Path: blob/main/lab4/lab4-Bayesian_Networks_MCMC_PyMC3.ipynb
340 views
Kernel: Python 3 (ipykernel)

Lab 4: Bayesian Networks, Markov Chain Monte Carlo (MCMC) and PyMC

In this lab we will focus on expressing probability distributions in the form of Bayesian Networks and using PyMC to (approximately) sample from these distributions and perform inference.

The last set of exercises in this lab will require the use of the pymc package so, if you are using your own machine, make sure to have this installed now (recommended) or before you get started on that section. (If you are using lab machines then you just need to ensure that the coms30035 virtual environment is active.) The PyMC installation instructions suggest that you install PyMC into a new conda environment. If you do that (and I would recommend it) you need to make sure any jupyter notebooks are run inside that conda environment. You can make this happen by first activating the conda environment and then installing jupyter in it. For example, if you chose to call your PyMC conda environment pymc_env then this should work (assuming you have conda set up correctly):

conda activate pymc_env conda install jupyter

OK, let's now get on with the exercises. First, we will import the required packages for the initial exercises:

import numpy as np import matplotlib.pyplot as plt from scipy.stats import norm %matplotlib inline

1) Ancestral Sampling

We will first recap the concept of ancestral sampling from lectures. Familiarise yourself with the example of a simple Bayesian network shown below.

(C=Cloudy, R=Rain, S=Sprinkler, W=Wet Grass and T/F refer to True/False)

This Bayesian network models the joint distribution over 4 variables P(C, R, S, W). To save time, the probability mass function for each of these variables are provided as Python functions below:

def P_C(): return 0.5 def P_R(C): return 0.2 if C == False else 0.8 def P_S(C): return 0.5 if C == False else 0.1 def P_W(R, S): if R == False: return 0.0 if S == False else 0.9 else: return 0.9 if S == False else 0.99

The concept of ancestral sampling is very simple and refers to a method of sampling from a distribution over multiple variables by first sampling from nodes in the graph that have no parents and then sampling from their child nodes conditioned on those sampled values. This process of sampling the child node variables conditioned on their parents is repeated until all nodes in the graph have a sampled value (remember that a valid graph must be acyclic so this process will always terminate with a finite number of nodes).

1.1) Perform Ancestral Sampling

Use the above functions to:

  1. Perform ancestral sampling to generate a number of samples (~100-1000 samples) from the joint distribution. (hint np.random.rand or np.random.choice are helpful to do this)

  2. Store the generated samples in a list of the form: [[C1\textbf{C}_1, R1\textbf{R}_1, S1\textbf{S}_1, W1\textbf{W}_1],[C2\textbf{C}_2, R2\textbf{R}_2, S2\textbf{S}_2, W2\textbf{W}_2], ...] where 1 and 2 refer to sample indices.

# write your code here

The ability to obtain a large number of samples from the joint distribution like this is very powerful because it allows for estimates of many different quantities relating to the distribution to be calculated. Use the list of samples to compute and print out estimates of the following:

  • Marginal distributions of each variable: P(C), P(R), P(S) and P(W).

  • Conditional distributions of each variable where W=T: P(C | W=T), P(R | W=T) and P(S | W=T) (hint discard samples where W=F).

# write your code here

For this simple example distribution you may have noticed that exact values for all of these quantities could have been computed directly without the need for sampling. However, as we will see in the following exercises there are many cases where sampling is still feasible but exact or direct computation is not.

2) Markov Chain Monte Carlo

Markov Chain Monte Carlo methods are a set of algorithms with the purpose of generating samples from a distribution. Let's break down the meaning of the individual terms:

  • Monte Carlo simply refers to the idea of approximating a complicated system with a statistical sample.

  • A Markov chain refers to a stochastic process involving a number of probabilisitic state transitions from one state to another. The Markov property states that any given state transition probability is determined by only the current state not any of the preceding states.

Together, Markov Chain Monte Carlo methods are a set of methods that use a Markov chain to (approximately) generate samples from some desired distribution. The Markov chain transition probabilites are set up in such a way that the sequence of probability distributions from which sampled states of the chain are drawn will eventually converge to this desired distribution. Note that the initial states of the chain may be sampled from distributions far from the desired distribution which is why they are typically discarded.

2.1) Defining a Gaussian probability density function

We will first define a simple distribution that we intend to generate samples from. In the cell below, create a function gaussian_pdf that has parameters mean (μ\mu) and standard deviation (σ\sigma) and returns a function for the Gaussian probability density function corresponding to those parameters: p(x)=12πσ2e12(xμσ)2\Large p(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}

You are free to use the norm function from scipy.stats.norm or define this function directly from the equation using numpy. Note that gaussian_pdf should take just two parameters (mu and sigma) and should return a function p(x)p(x) that takes a single parameter.

For this implementation it may be useful to use a lambda function which is a very useful Python feature that allows the creation of anonymous functions. If this concept is unfamiliar to you then please see this python tutorial here.

# write your code here

The code below uses this function to create and plot a Gaussian pdf with mu=0.0 and sigma=1.0.

true_mean = 0.0 true_variance = 1.0 z = np.linspace(-5.0, 5.0, 500) true_pdf = gaussian_pdf(mu=true_mean, sigma=np.sqrt(true_variance)) plt.figure(figsize=(10,5)) plt.plot(z, true_pdf(z)) plt.xlabel('$x$') plt.ylabel('$p(x)$') plt.show()

2.2) Metropolis Algorithm

The Metropolis Algorithm is one of the simplest instances of a Markov Chain Monte Carlo method. The goal of the algorithm is to generate samples from some distribution p(z)p(z) which in our case is a univariate Gaussian distribution. Let us assume that we do not have an easy way to sample from this distribution (in reality this distribution is very easy to sample from) and that we can only evaluate the unnormalised density, p~(z)\tilde{p}(z), where p(z)=p~(z)Zp\large p(z) = \frac{\tilde{p}(z)}{Z_p} and ZpZ_p may be unknown or computationally intractable.

To generate samples, z1,z2,...zNz_1, z_2, ... z_N , from p(z)p(z) using an MCMC method such as the Metropolis algorithm we must first define a proposal distribution, q(zt+1zt)q(z_{t+1}^\star \vert z_{t}), that uses the current state in the chain, ztz_t, to propose new states, zt+1z_{t+1}^\star. The only requirements for this proposal distribution are that it can easily be sampled directly and that it is symmetric (although this method can be extended for non-symmetric proposal distributions). For simplicity, in this case we will choose to use a Gaussian proposal distribution with a mean of ztz_t and a fixed variance σ2\sigma_\star^2: zt+1N(zt,σ2)\large z_{t+1}^\star \sim \mathcal{N}(z_t, \sigma_\star^2)

An initial value for the first state, z1z_1, is chosen at the beginning and is used to propose a value for the next state. The newly proposed value is either accepted as the next state in the chain, z2=z2z_2 = z_2^\star, or it is rejected, z2=z1z_2 = z_1. Acceptance occurs probabilistically with an acceptance probability of: A(zt+1,zt)=min(1,p~(zt+1)p~(zt))\large A(z_{t+1}^\star, z_t) = \text{min}\left(1, \frac{\tilde{p}(z_{t+1}^\star)}{\tilde{p}(z_t)}\right).

Note that when p~(zt+1)p~(zt)\tilde{p}(z_{t+1}^\star)\geq\tilde{p}(z_t) the new sample is always accepted. It turns out that if we take enough samples in this way then the distribution of ztz_t (typically) converges to p(z)p(z).

Implement the Metropolis algorithm described above and use it to generate samples ztz_t such that p(zt)p(z_{t}) converges to gaussian_pdf with mu=0.0 and sigma=1.0. Set your initial sample value z1=0z_1=0 and the proposal distribution variance σ2=0.25\sigma_\star^2=0.25. (hint sampling from the proposal distribution can be achieved using np.random.normal)

The plotting functions plot_samples and plot_samples_histogram have been provided to help with visualising your implementation:

  • Both functions require the true_pdf function and a list of generated samples samples_list as arguments.

  • plot_samples generates a single plot per sample so this should be used with a small number of samples (N20N\leq20).

  • plot_samples_histogram creates a single histogram plot and is a better visualisation for a larger number of samples (N5000N\geq5000)

def plot_samples(true_pdf, samples_list): assert len(samples_list) <= 20, "Number of samples too high! Please call this function with a maximum of 20 samples." for i in range(1, len(samples_list)): plt.figure(figsize=(10,5)) plt.plot(z, true_pdf(z)) for sample in samples_list[:i-1]: plt.axvline(sample, c='green') plt.axvline(samples_list[i-1], c='red') plt.show() def plot_samples_histogram(true_pdf, samples_list): plt.figure(figsize=(10,5)) plt.plot(z, true_pdf(z)) plt.hist(samples_list, density=True, histtype='step', bins=50) plt.show()
# write your code here

Exercises:

  1. Estimate the mean and variance of the true distribution from your samples. How does the accuracy of these estimates change if you generate more samples?

  2. The current implementation uses a normalised Gaussian pdf. Think about what would happen if the density function was unnormalised. Change gaussian_pdf to multiply all of its outputs by some constant value and check that you were correct.

  3. Go back to Section 2.1 and experiment with different density functions:

    • Create a function gaussian_mixture_pdf that returns the density function for a mixture of two Gaussians with parameters: μ1=2,σ1=0.5,μ2=2,σ2=0.5\mu_1 = -2, \sigma_1=0.5, \mu_2=2, \sigma_2=0.5. This can be achieved simply by creating the two density functions separately using gaussian_pdf and summing the result. Set this density to true_pdf and run the rest of your code again to generate and plot the distribution of samples from the Gaussian mixture. What do you notice about the distribution of your samples?

    • (Optional) Implement the density function for a Uniform distribution, uniform_pdf, over the range (2,2)(-2, 2) and run the code again. (hint: you can use scipy.stats.uniform or implement it yourself using numpy).

3) Bayesian Linear Regression with PyMC

You should be familiar with the concept of linear regression. Previously we have looked at using Maximum Likelihood Estimation (equivalently minimising the sum of squared errors) to obtain a single value (often known as a point-estimate) for the parameters. Here we take a Bayesian approach to linear regression where the goal is obtain a posterior probability distribution over the parameters.

Introducing PyMC

PyMC is a library that provides a lot of useful functionality for working with probabilistic models in Python. Importantly, it allows for Bayesian networks to be programmatically defined and it provides efficient implementations of a number of different MCMC methods including the Metropolis algorithm.

If you are using your own machine and haven't done so already then please make sure you have installed pymc before proceeding. PyMC uses ArviZ for plotting etc, so we start by importing both pymc and arviz:

import pymc as pm import arviz as az

3.1) Generating some example data for our linear model

We will first generate some example data D={(xi,yi)}N\mathcal{D}=\{(x_i, y_i)\}_N where yi=w0+w1xi+ϵ y_i = w_0 + w_1x_i + \epsilon with w0=6w_0=6, w1=2w_1=2, ϵN(μ=0,σ=1)\epsilon \sim \mathcal{N}(\mu=0, \sigma=1).

n = 50 true_w0 = 6 true_w1 = 2 true_sigma = 1 x = np.linspace(0, 1, n) y = true_w0 + true_w1*x + np.random.normal(scale=true_sigma, size=n)

Make a scatter plot of the data points and plot the line corresponding to the mean of yy using the known parameters.

# write your code here

3.2) Bayesian Linear Regression model in PyMC

Without knowledge of the true model parameters, the goal of Bayesian linear regression is to obtain a distribution (posterior) over the model parameters from the data, P(w0,w1,σD)P(w_0,w_1,\sigma \vert \mathcal{D}). First we need to define a prior over each of the three parameters. Let's choose fairly 'flat' priors (high variance) representing a lack of prior knowledge as to the true values of the parameters:

  • p(w0)=N(0,20)p(w_0) = \mathcal{N}(0, 20)

  • p(w1)=N(0,20)p(w_1) = \mathcal{N}(0, 20)

  • p(σ)=U(0,20)p(\sigma) = U(0, 20)

Below is the code to define this model in pymc. Note that this code performs MCMC using a No U-Turn Sampler (NUTS) which operates using the same principles as the Metropolis algorithm but is much more efficient.

num_samples = 1000 model = pm.Model() with model: # Defining our priors w0 = pm.Normal('w0', mu=0, sigma=20) w1 = pm.Normal('w1', mu=0, sigma=20) sigma = pm.Uniform('sigma', lower=0, upper=20) y_est = w0 + w1*x # auxiliary variables likelihood = pm.Normal('y', mu=y_est, sigma=sigma, observed=y) # inference sampler = pm.NUTS() # Hamiltonian MCMC with No U-Turn Sampler # or alternatively # sampler = pm.Metropolis() idata = pm.sample(num_samples, step=sampler, progressbar=True)

The idata variable is an Inference Object instance which now contains values sampled from distributions close to the posterior distribution for each of the model parameters (and other useful information). Information on each variable can be accessed from the idata.posterior attribute using dictionary syntax, for example: w1_trace = idata.posterior['w1'].

When using pyMC one almost always runs multiple chains: if the estimates (e.g. of posterior means) we get from each chain are close then that is evidence that each chain converged (i.e. ended up sampling from the same distribution, which will be the 'target' distribution, namely the posterior distribution). In the pyMC run we have just done we asked for 4 Markov chains (the default number) so idata.posterior['w1'] has information on all chains. The sampled values from the first chain are contained in idata.posterior['w1'][0]. idata.posterior['w1'][0] is an object of type xarray.DataArray. If you want the sampled values of the first chain for w1 as a numpy array then you need idata.posterior['w1'][0].values

Make the following plots:

  • Histogram of the samples of p(w0D)p(w_0 \vert \mathcal{D}), p(w1D)p(w_1 \vert \mathcal{D}) and p(σD)p(\sigma \vert \mathcal{D}) from the first chain.

  • A two-dimensional histogram of p(w0,w1D)p(w_0, w_1 \vert \mathcal{D}) (see hist2d) as estimated from values in the first chain. What is the relationship between these two parameters and how do you interpret this relationship in terms of the model?

# write your code here

Now that you have analysed the output of PyMC 'by hand', let's do it the easy way using the plot_trace method! Instead of generating a histogram we'll get a smoothed estimate of the posterior distributions. We'll do this first where we combine samples from the four chains and then get estimates of posterior distributions from the four chains separately.

az.plot_trace(idata, combined=True) az.plot_trace(idata,legend=True)

The traces on the right side are the sampled values used to construct the estimated posterior densities on the left. There are 4 different traces but they are rather hard to distinguish!

We can also get a useful textual summary of the MCMC run:

az.summary(idata, round_to=2)

The R^\hat{R} quantity (last column) is the Gelman-Rubin statistic which measures (roughly speaking) how similar our 4 Markov chains are. A value close to 1 (say less than 1.1) is evidence that both chains have converged. In the table above all R^\hat{R} values are exactly 1, so all is good.

Exercises:

  1. Experiment with changing the prior distributions within the model:

    • How does changing the prior distributions' mean or variance affect the posterior belief?

    • What happens if the uniform distribution prior over σ\sigma is changed to exclude the true value? (e.g. p(σ)=U(5,20)p(\sigma) = U(5, 20))

  2. Introduce a new parameter, w2w_2, to the data generation code such that yi=w0+w1xi+w2xi2+ϵ y_i = w_0 + w_1x_i + w_2x_i^2+ \epsilon and adjust the model to perform inference for this parameter from the data.

Wrap up

This lab covered a number of topics so let's recap:

  • First, we looked at a simple example of a graphical model and showed how to go about efficiently generating samples from a distribution that factorises over the graph using ancestral sampling.

  • We demonstrated the power of sampling by using samples to estimate quantities relating to the distribution.

  • We then looked at using the concepts of MCMC algorithms to generate samples from probability densities that are otherwise difficult to sample from and implemented the Metropolis algorithm.

  • Finally, we looked at the PyMC python library and how its efficient implementations of MCMC algorithms can be utilised within a Bayesian Linear Regression model.

References

  • COMS30035 Machine Learning lecture notes.

  • Bishop Pattern Recognition and Machine Learning: Chapter 3.3 for Bayesian linear regression, chapter 8.1 for graphical models and chapter 11.2 for MCMC.