Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/thinkbayes2
Path: blob/master/examples/dual_log.ipynb
1901 views
Kernel: Python 3 (ipykernel)

Collinearity and Bayesian Regression

A draft response to this question on Reddit.

This notebook is based on Think Bayes 2e, which is available from Bookshop.org and Amazon.

Click here to run this notebook on Colab.

try: import empiricaldist except ImportError: !pip install empiricaldist
# download utils.py from os.path import basename, exists def download(url): filename = basename(url) if not exists(filename): from urllib.request import urlretrieve local, _ = urlretrieve(url, filename) print("Downloaded " + local) download("https://github.com/AllenDowney/ThinkBayes2/raw/master/soln/utils.py")
import numpy as np import pandas as pd import matplotlib.pyplot as plt from utils import decorate plt.rcParams["figure.dpi"] = 75 plt.rcParams["figure.figsize"] = [6, 3.5]

The well-behaved dataset

data = { "t": [ 6.305, 13.511, 20.719, 27.828, 35.038, 42.247, 49.451, 56.658, 63.861, 71.07, 78.275, 85.38, 92.589, 99.806, 107.013, 114.218, 121.431, 128.639, 135.844, 142.952, ], "y": [ 2.73463519e-14, 3.85426599e-14, 3.80828479e-14, 3.16131129e-14, 3.45657229e-14, 4.07274529e-14, 4.06697749e-14, 4.87061129e-14, 5.76658129e-14, 5.52670029e-14, 5.26504439e-14, 4.84613229e-14, 6.77128129e-14, 5.81555409e-14, 5.51561319e-14, 6.93846929e-14, 6.32508629e-14, 5.86075490e-14, 5.60123029e-14, 6.57248479e-14, ], } df = pd.DataFrame(data) df["y"] *= 1e15 df.head()
<IPython.core.display.Javascript object>

The problematic dataset, scaled so y is in fA.

data = { "t": [5.98, 13.183, 20.385, 27.592, 34.801, 41.907, 49.112, 56.323, 63.526, 70.73, 77.944, 85.152, 92.258, 99.464, 106.672, 113.878, 121.091, 128.299, 135.508, 142.713], "y": [5.03036152e-14, 5.56276292e-14, 5.32424602e-14, 3.89444756e-14, 4.69391602e-14, 5.39758602e-14, 5.45844802e-14, 6.32446202e-14, 5.73647722e-14, 6.10179202e-14, 6.95350302e-14, 7.45465402e-14, 6.81008702e-14, 7.72761312e-14, 8.76658102e-14, 9.02147102e-14, 9.10902732e-14, 8.62184102e-14, 8.40578302e-14, 9.40117102e-14] } # Create the DataFrame df = pd.DataFrame(data) df['y'] *= 1e15 df.head()
<IPython.core.display.Javascript object>

Compute the explanatory variables.

df["t32"] = np.log(df["t"] + 32) df["t30"] = np.log(df["t"] + 30) df[["t30", "t32"]].corr()
<IPython.core.display.Javascript object>

The explanatory variables are highly correlated, which is going to be a problem.

import statsmodels.formula.api as smf model = smf.ols(formula="y ~ t32 + t30", data=df).fit() df["pred"] = model.predict(df) model.summary()
<IPython.core.display.Javascript object>

Sure enough, ols complains about the collinearity.

And here's what the fitted model looks like.

plt.scatter(df["t"], df["y"], label="Observed") plt.plot(df["t"], df["pred"], label="Predicted") decorate(xlabel="Time (s)") decorate(ylabel="Intensity (fA)")
Image in a Jupyter notebook
<IPython.core.display.Javascript object>

The up-swoop on the left is non-physical.

Let's see how Bayesian regression handles collinearity.

import pymc as pm import arviz as az with pm.Model() as model: # Priors for coefficients a = pm.Normal("a", mu=0, sigma=100) b = pm.Normal("b", mu=0, sigma=100) c = pm.Normal("c", mu=0, sigma=100) # Model equation mu = a * df["t32"] + b * df["t30"] + c # Observational noise sigma = pm.HalfNormal("sigma", sigma=100) # Likelihood y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=df["y"]) # Sampling idata = pm.sample(500)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [a, b, c, sigma]
Sampling 4 chains for 1_000 tune and 500 draw iterations (4_000 + 2_000 draws total) took 90 seconds. There were 49 divergences after tuning. Increase `target_accept` or reparameterize.
<IPython.core.display.Javascript object>

The sampling process throws some divergences, which makes sense because the high-density part of the probability space is a long, narrow tunnel, so it's easy to spin out.

Nevertheless, the diagnostics look fine.

az.plot_trace(idata) plt.tight_layout()
Image in a Jupyter notebook
<IPython.core.display.Javascript object>

The r_hat values are fine.

The only notable problem is that the standard deviations are super big, but that's because the model borders on being non-identifiable.

az.summary(idata)
<IPython.core.display.Javascript object>

To show what the issue is, let's look at the joint distribution of a and b.

a_samples = idata.posterior["a"].values.flatten() b_samples = idata.posterior["b"].values.flatten() c_samples = idata.posterior["c"].values.flatten() plt.hist2d(a_samples, b_samples, bins=50, cmap="Blues", density=True) plt.xlabel("a (Coefficient for ln(t+32))") plt.ylabel("b (Coefficient for ln(t+30))") plt.title("Joint Distribution of a and b") plt.tight_layout()
Image in a Jupyter notebook
<IPython.core.display.Javascript object>

They are highly correlated, which indicates that we can't be confident about the actual values of a and b, but we can be confident about their sum -- and that's all we need.

Here's the posterior distribution of the sum, with a much more reasonable range of values.

diff_samples = a_samples + b_samples az.plot_posterior(diff_samples) plt.tight_layout()
Image in a Jupyter notebook
<IPython.core.display.Javascript object>

Instead of looking at the parameters, let's look at the predictions. The following figure shows the model predictions for 50 of the samples.

ts = np.linspace(0, df["t"].max()) ln_t_plus_32 = np.log(ts + 32) ln_t_plus_30 = np.log(ts + 30) plt.scatter(df["t"], df["y"], label="Observed Data") n = len(a_samples) for i in range(0, n, n // 50): y_pred = a_samples[i] * ln_t_plus_32 + b_samples[i] * ln_t_plus_30 + c_samples[i] plt.plot(ts, y_pred, alpha=0.1)
Image in a Jupyter notebook
<IPython.core.display.Javascript object>

None of them have the non-physical upswoop.

We can use the sampled parameters to compute the distribution of the intercept at t=0.

intercept_samples = a_samples * np.log(32) + b_samples * np.log(30) + c_samples az.plot_posterior(intercept_samples) plt.tight_layout()
Image in a Jupyter notebook
<IPython.core.display.Javascript object>

Based on the experimental design and the data, that might be the best estimate we can make.

As an aside, since this model only has three parameters, we could use a grid algorithm to approximate the posterior distribution quickly and deterministically.