Path: blob/master/docs/notebooks/bayesian-estimation-on-multiple-groups.ipynb
419 views
Bayesian Estimation on Multiple Groups
Problem Type
The Bayesian estimation model is widely applicable across a number of scenarios. The classical scenario is when we have an experimental design where there is a control vs. a treatment, and we want to know what the difference is between the two. Here, "estimation" is used to estimate the "true" value for the control and the "true" value for the treatment, and the "Bayesian" part refers to the computation of the uncertainty surrounding the parameter.
Bayesian estimation's advantages over the classical t-test was first described by John Kruschke (2013).
In this notebook, I provide a concise implementation suitable for two-sample and multi-sample inference, with data that don't necessarily fit Gaussian assumptions.
Data structure
To use it with this model, the data should be structured as such:
Each row is one measurement.
The columns should indicate, at the minimum:
What treatment group the sample belonged to.
The measured value.
Extensions to the model
As of now, the model only samples posterior distributions of measured values. The model, then, may be extended to compute differences in means (sample vs. control) or effect sizes, complete with uncertainty around it. Use pm.Deterministic(...)
to ensure that those statistics' posterior distributions, i.e. uncertainty, are also computed.
Reporting summarized findings
Here are examples of how to summarize the findings.
Treatment group A was greater than control by x units (95% HPD: [
lower
,upper
]).
Treatment group A was higher than control (effect size 95% HPD: [
lower
,upper
]).
Model Specification
We know that the OD600
and measurements
columns are all positive-valued, and so the normalized_measurement
column will also be positive-valued. There are two ways to handle this situation:
We can either choose to directly model the likelihood using a bounded, positive-support-only distribution, or
We can model the log-transformation of the
normalized_measurement
column, using an unbounded, infinite-support distribution (e.g. the T-distribution family of distributions, which includes the Gaussian and the Cauchy in there).
The former is ever slightly more convenient to reason about, but the latter lets us use Gaussians, which have some nice properties when sampling.
Inference Button!
Now, we hit the Inference Button(tm) and sample from the posterior distribution.
Diagnostics
Our first diagnostic will be the trace plot. We expect the trace of the variables that we are most interested in to be a fuzzy caterpillar.
Looking at the traces, yes, everything looks more or less like a hairy caterpillar. This means that sampling went well, and has converged, thus we have a good MCMC estimator of the posterior distribution.
I need a mapping of isolate to its encoding - will come in handy below.
Let's now plot the posterior distributions. We'll use a ridge plot, as it's both aesthetically pleasing and informative.
On the basis of this, we would say that strain 5 was the most different from the other strains.
Let's now look at the differences directly.
If we were in a binary decision-making mode, we would say that isolates 5 was the most "significantly" different from the ATCC strain.
The PPC draws clearly have longer tails than do the originals. I chalk this down to having small number of samples. The central tendency is definitely modelled well, and I don't see wild deviations between the sampled posterior and the measured data.