Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
AllenDowney
GitHub Repository: AllenDowney/bayesian-analysis-recipes
Path: blob/master/incubator/survival-analysis.ipynb
411 views
Kernel: bayesian

Introduction

In this notebook, I want to reproduce (and more importantly, annotate) what Austin Rochford did in his Bayesian Survival Analysis notebook on his own blog.

Setup

In this dataset, we will analyze breast cancer patients who have undergone masectomy.

import pymc3 as pm import matplotlib.pyplot as plt import numpy as np from statsmodels import datasets import seaborn as sns import theano.tensor as T %load_ext autoreload %autoreload 2 %matplotlib inline %config InlineBackend.figure_format = 'retina'

First off, we load in the dataset.

df = datasets.get_rdataset("mastectomy", "HSAUR", cache=True).data df.sample(5)

Inspecting what it looks like, it should be:

  • Each row is one patient.

  • For each patient, we record:

    • Time elapsed since masectomy (time column)

    • Whether they have died (event column)

    • Whether a metastasis has occcurred (metastasized column)

In order to get the data into shape for computing, we will need to do some data preprocessing.

  • Convert event to 1/0, where 1 maps to True.

  • Convert metastasized to 1/0, where 1 maps to yes.

df["event"] = df["event"].astype(int) df["metastized"] = df["metastized"].replace("yes", 1).replace("no", 0).astype(int) df.head(5)

From Austin's blog post:

A suitable prior on λ0(t){\lambda}_{0}(t) is less obvious. We choose a semiparametric prior, where λ0(t){\lambda}_{0}(t) is a piecewise constant function. This prior requires us to partition the time range in question into intervals with endpoints 0s1<s2<<sN0≤s_{1}<s_{2}<⋯<s_{N}. With this partition, λ0(t)=λj{\lambda}_{0}(t)={\lambda}_{j} if sjt<sj+1s_{j}≤t<s_{j}+1. With λ0(t){\lambda}_{0}(t) constrained to have this form, all we need to do is choose priors for the N1N−1 values λj{\lambda}_{j}. We use independent vague priors λjGamma(102,102){\lambda}_{j}\sim Gamma(10^{−2},10^{2}). For our mastectomy example, we make each interval three months long.

interval_length = 3 interval_bounds = np.arange(0, df.time.max() + interval_length + 1, interval_length) n_intervals = interval_bounds.size - 1 intervals = np.arange(n_intervals) intervals
n_patients = len(df) patients = np.arange(n_patients) last_period = np.floor((df.time - 0.01) / interval_length) death = np.zeros((n_patients, n_intervals)) death[patients.astype(int), last_period.values.astype(int)] = df.event.values sns.heatmap(death)

Aha! This matrix tells us which patients have died and when. Each row is one patient, each column is a time period, and a 1 (white) indicates that a patient died in that time period.

exposure = ( np.greater_equal.outer(df.time.values, interval_bounds[:-1]) * interval_length ) exposure[patients, last_period.values.astype(int)] = ( df.time - interval_bounds[last_period.values.astype(int)] ) sns.heatmap(exposure)

This heatmap tells us the time of exposure of a patient. Exposure is defined by whether they are exposed to a risk of dying or not. A patient is at risk of exposure if they have not died, up till the time they die (the non-censored patients) or the time that they have survived (the censored patients).

with pm.Model() as model: lambda0 = pm.Gamma("lambda0", 0.01, 0.01, shape=n_intervals) sigma = pm.Uniform("sigma", 0.0, 10.0) tau = pm.Deterministic("tau", sigma ** -2) mu_beta = pm.Normal("mu_beta", 0.0, 10 ** -2) beta = pm.Normal("beta", mu_beta, tau) lambda_ = pm.Deterministic("lambda_", T.outer(T.exp(beta * df.metastized), lambda0)) mu = pm.Deterministic("mu", exposure * lambda_) obs = pm.Poisson("obs", mu, observed=death)
with model: trace = pm.sample(2000, njobs=1)
pm.traceplot(trace, varnames=["beta"])
base_hazard = trace["lambda0"] met_hazard = trace["lambda0"] * np.exp(np.atleast_2d(trace["beta"]).T)
def cum_hazard(hazard): """ Cumulative hazard is the cumulative sum of the interval length * hazard at each time interval. """ return (interval_length * hazard).cumsum(axis=-1) def survival(hazard): """ Survival is the exponent of the negative cumulative hazard. """ return np.exp(-cum_hazard(hazard))
sns.heatmap(cum_hazard(met_hazard).T)
with pm.Model() as time_varying_model: lambda0 = pm.Gamma("lambda0", 0.01, 0.01, shape=n_intervals) # The key is here: if we want a parameter to vary with time, then making it follow a # Gaussian random walk is one way of doing so. beta = pm.GaussianRandomWalk("beta", tau=1.0, shape=n_intervals) lambda_ = pm.Deterministic( "h", lambda0 * T.exp(T.outer(T.constant(df.metastized), beta)) ) mu = pm.Deterministic("mu", exposure * lambda_) obs = pm.Poisson("obs", mu, observed=death)
with time_varying_model: trace_varying = pm.sample(2000, njobs=1)

NUTS sampling of both of these models has been really slow. I'm not quite sure how to debug this.

pm.traceplot(trace_varying, varnames=["beta"])
trace_varying["beta"].shape
fig = plt.figure() ax = fig.add_subplot(111) beta_hpd = np.percentile(trace_varying["beta"], [2.5, 97.5], axis=0) beta_low = beta_hpd[0] beta_high = beta_hpd[1] ax.fill_between(interval_bounds[:-1], beta_low, beta_high, color="blue", alpha=0.25) beta_hat = trace_varying["beta"].mean(axis=0) ax.step(interval_bounds[:-1], beta_hat, color="blue")