CoCalc -- IC50_synthetic-bayesian.ipynb

GitHub Repository: AllenDowney/bayesian-analysis-recipes
Path: blob/master/incubator/IC50_synthetic-bayesian.ipynb
⁴¹¹ views

Kernel: bayesian

In [ ]:

import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
import pandas as pd
import theano.tensor as tt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Problem Type

IC₅₀ determination is a commonplace problem type for experimental biochemists. The goal is to determine the concentration of an inhibitor that causes a protein to have 1/2 as much activity as without the inhibitor.

Classically, this is done by fitting a logistic function of some sort, where the IC₅₀ is a parameter in the model. We then want to determine IC₅₀, given the data at hand.

In the implementation here, we allow for multi-drug IC₅₀ determination. Additionally, I also show how we can have a mix of noisy data and have non-uniform coverage over the drug concentration space, and still come to a fairly precise determination of the IC₅₀ values as long as we have dense enough data around the true IC50 area. The simulated data below assumes that a first-pass experiment was done over a wide range of data points, followed by a second-pass set of experiments where a range of concentrations was filled in with more resolution.

Data structure

To use it with this model, the data should be structured as such:

Each row is one measurement for one concentration of drug.
The columns should indicate, at the minimum:
- What treatment group the sample belonged to.
- The measured value.

Extensions to the model

None

Reporting summarized findings

Here are examples of how to summarize the findings.

The IC₅₀ of drug X was mean (95% HPD: [lower, upper]).

Other notes

None

In [ ]:

# Generate some fake data.
x_concs = np.concatenate(
    [
        np.arange(0, 100, 10),
        np.arange(5, 21, 1),
        np.arange(40, 51, 1),
        np.arange(85, 96, 1),
    ]
).reshape(-1, 1)
log_xconcs = np.log(x_concs)
ic50_true = np.array([42, 13, 88])
ic50_true = ic50_true.reshape(-1, ic50_true.shape[0])
beta_true = 1
slope_true = 1
intercept_true = 150
# y_true = slope_true * x_concs + intercept_true

y_true = beta_true / (1 + np.exp(x_concs - ic50_true))

y_noisy = y_true + np.random.normal(0, 0.15, size=y_true.shape)  # homoskedastic error
y_noisy[0:5]

In [ ]:

plt.scatter(x_concs, y_noisy[:, 0])
plt.scatter(x_concs, y_noisy[:, 1])
plt.scatter(x_concs, y_noisy[:, 2])

Note how there's sparse coverage in the regions outside of the true IC₅₀s.

In [ ]:

concentrations = np.concatenate([x_concs] * ic50_true.shape[1])
concentrations[::10]  # print every 10th

In [ ]:

concentrations.shape

In [ ]:

y_noisy.flatten(order="F").shape

In [ ]:

x_concs.shape[0]

In [ ]:

data = pd.DataFrame()
data["concentrations"] = concentrations.reshape(concentrations.shape[0],)
data["measurements"] = y_noisy.flatten(order="F")

drugs = []
for i in range(ic50_true.shape[1]):
    drugs.extend([i] * x_concs.shape[0])
data["drug"] = drugs

data = pm.floatX(data)


from sklearn.preprocessing import LabelEncoder, MinMaxScaler

le = LabelEncoder()
data["idxs"] = le.fit_transform(data["drug"]).astype("int32")

# Normalize data['measurements'] to 0-1
# mms = MinMaxScaler()
# data['measurements'] = mms.fit_transform(data['measurements'].values.reshape(-1, 1))

In [ ]:

data.head().dtypes

In [ ]:

with pm.Model() as model:
    datasize = len(set(data["idxs"]))
    beta = pm.Normal("beta", mu=0, sd=100 ** 2, shape=datasize)
    noise = pm.HalfCauchy("noise", beta=100 ** 2, shape=datasize)
    ic50 = pm.Normal("IC50", sd=100 ** 2, shape=datasize)
    measurements = pm.Deterministic(
        "measurements",
        beta[data["idxs"].values]
        / (1 + tt.exp(data["concentrations"].values - ic50[data["idxs"].values])),
    )

    y_like = pm.Normal(
        "y_like",
        mu=measurements,
        sd=noise[data["idxs"].values],
        observed=data["measurements"],
    )

In [ ]:

with model:
    trace = pm.sample(draws=10000, step=pm.Metropolis(), start=pm.find_MAP())
    # trace = pm.sample(draws=2000)  # ADVI init is fast, but NUTS sampling is slow later on.

In [ ]:

pm.traceplot(trace[5000:])

Note the posterior distributions around the IC50s are fairly nice and tight.

In [ ]:

pm.summary(trace[5000:])

In [ ]:

In [ ]:

Problem Type

Data structure

Extensions to the model

Reporting summarized findings

Other notes

Product

Resources

Company