Introduction

In this notebook, I want to reproduce (and more importantly, annotate) what Austin Rochford did in his Bayesian Survival Analysis notebook on his own blog.

Setup

In this dataset, we will analyze breast cancer patients who have undergone masectomy.

In [ ]:

import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
from statsmodels import datasets
import seaborn as sns
import theano.tensor as T

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

First off, we load in the dataset.

In [ ]:

df = datasets.get_rdataset("mastectomy", "HSAUR", cache=True).data
df.sample(5)

Inspecting what it looks like, it should be:

Each row is one patient.
For each patient, we record:
- Time elapsed since masectomy (time column)
- Whether they have died (event column)
- Whether a metastasis has occcurred (metastasized column)

In order to get the data into shape for computing, we will need to do some data preprocessing.

Convert event to 1/0, where 1 maps to True.
Convert metastasized to 1/0, where 1 maps to yes.

In [ ]:

df["event"] = df["event"].astype(int)
df["metastized"] = df["metastized"].replace("yes", 1).replace("no", 0).astype(int)
df.head(5)

From Austin's blog post:

A suitable prior on ${\lambda}_{0}(t)$ is less obvious. We choose a semiparametric prior, where ${\lambda}_{0}(t)$ is a piecewise constant function. This prior requires us to partition the time range in question into intervals with endpoints $0≤s_{1}<s_{2}<⋯<s_{N}$ . With this partition, ${\lambda}_{0}(t)={\lambda}_{j}$ if $s_{j}≤t<s_{j}+1$ . With ${\lambda}_{0}(t)$ constrained to have this form, all we need to do is choose priors for the $N−1$ values ${\lambda}_{j}$ . We use independent vague priors ${\lambda}_{j}\sim Gamma(10^{−2},10^{2})$ . For our mastectomy example, we make each interval three months long.

In [ ]:

interval_length = 3
interval_bounds = np.arange(0, df.time.max() + interval_length + 1, interval_length)
n_intervals = interval_bounds.size - 1
intervals = np.arange(n_intervals)
intervals

In [ ]:

n_patients = len(df)
patients = np.arange(n_patients)

last_period = np.floor((df.time - 0.01) / interval_length)

death = np.zeros((n_patients, n_intervals))
death[patients.astype(int), last_period.values.astype(int)] = df.event.values
sns.heatmap(death)

Aha! This matrix tells us which patients have died and when. Each row is one patient, each column is a time period, and a 1 (white) indicates that a patient died in that time period.

In [ ]:

exposure = (
    np.greater_equal.outer(df.time.values, interval_bounds[:-1]) * interval_length
)
exposure[patients, last_period.values.astype(int)] = (
    df.time - interval_bounds[last_period.values.astype(int)]
)
sns.heatmap(exposure)

This heatmap tells us the time of exposure of a patient. Exposure is defined by whether they are exposed to a risk of dying or not. A patient is at risk of exposure if they have not died, up till the time they die (the non-censored patients) or the time that they have survived (the censored patients).

In [ ]:

with pm.Model() as model:
    lambda0 = pm.Gamma("lambda0", 0.01, 0.01, shape=n_intervals)

    sigma = pm.Uniform("sigma", 0.0, 10.0)
    tau = pm.Deterministic("tau", sigma ** -2)
    mu_beta = pm.Normal("mu_beta", 0.0, 10 ** -2)
    beta = pm.Normal("beta", mu_beta, tau)

    lambda_ = pm.Deterministic("lambda_", T.outer(T.exp(beta * df.metastized), lambda0))
    mu = pm.Deterministic("mu", exposure * lambda_)

    obs = pm.Poisson("obs", mu, observed=death)

In [ ]:

with model:
    trace = pm.sample(2000, njobs=1)

In [ ]:

pm.traceplot(trace, varnames=["beta"])

In [ ]:

base_hazard = trace["lambda0"]
met_hazard = trace["lambda0"] * np.exp(np.atleast_2d(trace["beta"]).T)

In [ ]:

def cum_hazard(hazard):
    """
    Cumulative hazard is the cumulative sum of the 
    interval length * hazard at each time interval.
    """
    return (interval_length * hazard).cumsum(axis=-1)


def survival(hazard):
    """
    Survival is the exponent of the negative cumulative hazard.
    """
    return np.exp(-cum_hazard(hazard))

In [ ]:

sns.heatmap(cum_hazard(met_hazard).T)

In [ ]:

with pm.Model() as time_varying_model:
    lambda0 = pm.Gamma("lambda0", 0.01, 0.01, shape=n_intervals)

    # The key is here: if we want a parameter to vary with time, then making it follow a
    # Gaussian random walk is one way of doing so.
    beta = pm.GaussianRandomWalk("beta", tau=1.0, shape=n_intervals)

    lambda_ = pm.Deterministic(
        "h", lambda0 * T.exp(T.outer(T.constant(df.metastized), beta))
    )
    mu = pm.Deterministic("mu", exposure * lambda_)

    obs = pm.Poisson("obs", mu, observed=death)

In [ ]:

with time_varying_model:
    trace_varying = pm.sample(2000, njobs=1)

NUTS sampling of both of these models has been really slow. I'm not quite sure how to debug this.

In [ ]:

pm.traceplot(trace_varying, varnames=["beta"])

In [ ]:

trace_varying["beta"].shape

In [ ]:

fig = plt.figure()
ax = fig.add_subplot(111)

beta_hpd = np.percentile(trace_varying["beta"], [2.5, 97.5], axis=0)
beta_low = beta_hpd[0]
beta_high = beta_hpd[1]
ax.fill_between(interval_bounds[:-1], beta_low, beta_high, color="blue", alpha=0.25)
beta_hat = trace_varying["beta"].mean(axis=0)
ax.step(interval_bounds[:-1], beta_hat, color="blue")

In [ ]:

Introduction

Setup

Product

Resources

Company