Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/en-snapshot/probability/examples/Eight_Schools.ipynb
25118 views
Kernel: Python 3

Licensed under the Apache License, Version 2.0 (the "License");

#@title Licensed under the Apache License, Version 2.0 (the "License"); { display-mode: "form" } # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

The eight schools problem (Rubin 1981) considers the effectiveness of SAT coaching programs conducted in parallel at eight schools. It has become a classic problem (Bayesian Data Analysis, Stan) that illustrates the usefulness of hierarchical modeling for sharing information between exchangeable groups.

The implemention below is an adaptation of an Edward 1.0 tutorial.

Imports

import matplotlib.pyplot as plt import numpy as np import seaborn as sns import tensorflow.compat.v2 as tf import tensorflow_probability as tfp from tensorflow_probability import distributions as tfd import warnings tf.enable_v2_behavior() plt.style.use("ggplot") warnings.filterwarnings('ignore')

The Data

From Bayesian Data Analysis, section 5.5 (Gelman et al. 2013):

A study was performed for the Educational Testing Service to analyze the effects of special coaching programs for SAT-V (Scholastic Aptitude Test-Verbal) in each of eight high schools. The outcome variable in each study was the score on a special administration of the SAT-V, a standardized multiple choice test administered by the Educational Testing Service and used to help colleges make admissions decisions; the scores can vary between 200 and 800, with mean about 500 and standard deviation about 100. The SAT examinations are designed to be resistant to short-term efforts directed specifically toward improving performance on the test; instead they are designed to reflect knowledge acquired and abilities developed over many years of education. Nevertheless, each of the eight schools in this study considered its short-term coaching program to be very successful at increasing SAT scores. Also, there was no prior reason to believe that any of the eight programs was more effective than any other or that some were more similar in effect to each other than to any other.

For each of the eight schools (J=8J = 8), we have an estimated treatment effect yjy_j and a standard error of the effect estimate σj\sigma_j. The treatment effects in the study were obtained by a linear regression on the treatment group using PSAT-M and PSAT-V scores as control variables. As there was no prior belief that any of the schools were more or less similar or that any of the coaching programs would be more effective, we can consider the treatment effects as exchangeable.

num_schools = 8 # number of schools treatment_effects = np.array( [28, 8, -3, 7, -1, 1, 18, 12], dtype=np.float32) # treatment effects treatment_stddevs = np.array( [15, 10, 16, 11, 9, 11, 10, 18], dtype=np.float32) # treatment SE fig, ax = plt.subplots() plt.bar(range(num_schools), treatment_effects, yerr=treatment_stddevs) plt.title("8 Schools treatment effects") plt.xlabel("School") plt.ylabel("Treatment effect") fig.set_size_inches(10, 8) plt.show()
Image in a Jupyter notebook

Model

To capture the data, we use a hierarchical normal model. It follows the generative process,

μNormal(loc=0, scale=10)logτNormal(loc=5, scale=1)for i=18:θiNormal(loc=μ, scale=τ)yiNormal(loc=θi, scale=σi)\begin{align*} \mu &\sim \text{Normal}(\text{loc}{=}0,\ \text{scale}{=}10) \\ \log\tau &\sim \text{Normal}(\text{loc}{=}5,\ \text{scale}{=}1) \\ \text{for } & i=1\ldots 8:\\ & \theta_i \sim \text{Normal}\left(\text{loc}{=}\mu,\ \text{scale}{=}\tau \right) \\ & y_i \sim \text{Normal}\left(\text{loc}{=}\theta_i,\ \text{scale}{=}\sigma_i \right) \end{align*}

where μ\mu represents the prior average treatment effect and τ\tau controls how much variance there is between schools. The yiy_i and σi\sigma_i are observed. As τ\tau \rightarrow \infty, the model approaches the no-pooling model, i.e., each of the school treatment effect estimates are allowed to be more independent. As τ0\tau \rightarrow 0, the model approaches the complete-pooling model, i.e., all of the school treatment effects are closer to the group average μ\mu. To restrict the standard deviation to be positive, we draw τ\tau from a lognormal distribution (which is equivalent to drawing log(τ)log(\tau) from a normal distribution).

Following Diagnosing Biased Inference with Divergences, we transform the model above into an equivalent non-centered model:

μNormal(loc=0, scale=10)logτNormal(loc=5, scale=1)for i=18:θiNormal(loc=0, scale=1)θi=μ+τθiyiNormal(loc=θi, scale=σi)\begin{align*} \mu &\sim \text{Normal}(\text{loc}{=}0,\ \text{scale}{=}10) \\ \log\tau &\sim \text{Normal}(\text{loc}{=}5,\ \text{scale}{=}1) \\ \text{for } & i=1\ldots 8:\\ & \theta_i' \sim \text{Normal}\left(\text{loc}{=}0,\ \text{scale}{=}1 \right) \\ & \theta_i = \mu + \tau \theta_i' \\ & y_i \sim \text{Normal}\left(\text{loc}{=}\theta_i,\ \text{scale}{=}\sigma_i \right) \end{align*}

We reify this model as a JointDistributionSequential instance:

model = tfd.JointDistributionSequential([ tfd.Normal(loc=0., scale=10., name="avg_effect"), # `mu` above tfd.Normal(loc=5., scale=1., name="avg_stddev"), # `log(tau)` above tfd.Independent(tfd.Normal(loc=tf.zeros(num_schools), scale=tf.ones(num_schools), name="school_effects_standard"), # `theta_prime` reinterpreted_batch_ndims=1), lambda school_effects_standard, avg_stddev, avg_effect: ( tfd.Independent(tfd.Normal(loc=(avg_effect[..., tf.newaxis] + tf.exp(avg_stddev[..., tf.newaxis]) * school_effects_standard), # `theta` above scale=treatment_stddevs), name="treatment_effects", # `y` above reinterpreted_batch_ndims=1)) ]) def target_log_prob_fn(avg_effect, avg_stddev, school_effects_standard): """Unnormalized target density as a function of states.""" return model.log_prob(( avg_effect, avg_stddev, school_effects_standard, treatment_effects))

Bayesian Inference

Given data, we perform Hamiltonian Monte Carlo (HMC) to calculate the posterior distribution over the model's parameters.

num_results = 5000 num_burnin_steps = 3000 # Improve performance by tracing the sampler using `tf.function` # and compiling it using XLA. @tf.function(autograph=False, jit_compile=True) def do_sampling(): return tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=[ tf.zeros([], name='init_avg_effect'), tf.zeros([], name='init_avg_stddev'), tf.ones([num_schools], name='init_school_effects_standard'), ], kernel=tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=target_log_prob_fn, step_size=0.4, num_leapfrog_steps=3)) states, kernel_results = do_sampling() avg_effect, avg_stddev, school_effects_standard = states school_effects_samples = ( avg_effect[:, np.newaxis] + np.exp(avg_stddev)[:, np.newaxis] * school_effects_standard) num_accepted = np.sum(kernel_results.is_accepted) print('Acceptance rate: {}'.format(num_accepted / num_results))
Acceptance rate: 0.5974
fig, axes = plt.subplots(8, 2, sharex='col', sharey='col') fig.set_size_inches(12, 10) for i in range(num_schools): axes[i][0].plot(school_effects_samples[:,i].numpy()) axes[i][0].title.set_text("School {} treatment effect chain".format(i)) sns.kdeplot(school_effects_samples[:,i].numpy(), ax=axes[i][1], shade=True) axes[i][1].title.set_text("School {} treatment effect distribution".format(i)) axes[num_schools - 1][0].set_xlabel("Iteration") axes[num_schools - 1][1].set_xlabel("School effect") fig.tight_layout() plt.show()
Image in a Jupyter notebook
print("E[avg_effect] = {}".format(np.mean(avg_effect))) print("E[avg_stddev] = {}".format(np.mean(avg_stddev))) print("E[school_effects_standard] =") print(np.mean(school_effects_standard[:, ])) print("E[school_effects] =") print(np.mean(school_effects_samples[:, ], axis=0))
E[avg_effect] = 5.57183933258 E[avg_stddev] = 2.47738981247 E[school_effects_standard] = 0.08509017 E[school_effects] = [15.0051 7.103311 2.4552586 6.2744603 1.3364682 3.1125953 12.762501 7.743602 ]
# Compute the 95% interval for school_effects school_effects_low = np.array([ np.percentile(school_effects_samples[:, i], 2.5) for i in range(num_schools) ]) school_effects_med = np.array([ np.percentile(school_effects_samples[:, i], 50) for i in range(num_schools) ]) school_effects_hi = np.array([ np.percentile(school_effects_samples[:, i], 97.5) for i in range(num_schools) ])
fig, ax = plt.subplots(nrows=1, ncols=1, sharex=True) ax.scatter(np.array(range(num_schools)), school_effects_med, color='red', s=60) ax.scatter( np.array(range(num_schools)) + 0.1, treatment_effects, color='blue', s=60) plt.plot([-0.2, 7.4], [np.mean(avg_effect), np.mean(avg_effect)], 'k', linestyle='--') ax.errorbar( np.array(range(8)), school_effects_med, yerr=[ school_effects_med - school_effects_low, school_effects_hi - school_effects_med ], fmt='none') ax.legend(('avg_effect', 'HMC', 'Observed effect'), fontsize=14) plt.xlabel('School') plt.ylabel('Treatment effect') plt.title('HMC estimated school treatment effects vs. observed data') fig.set_size_inches(10, 8) plt.show()
Image in a Jupyter notebook

We can observe the shrinkage toward the group avg_effect above.

print("Inferred posterior mean: {0:.2f}".format( np.mean(school_effects_samples[:,]))) print("Inferred posterior mean se: {0:.2f}".format( np.std(school_effects_samples[:,])))
Inferred posterior mean: 6.97 Inferred posterior mean se: 10.41

Criticism

To get the posterior predictive distribution, i.e., a model of new data yy^* given the observed data yy:

p(yy)θp(yθ)p(θy)dθp(y^*|y) \propto \int_\theta p(y^* | \theta)p(\theta |y)d\theta

we override the values of the random variables in the model to set them to the mean of the posterior distribution, and sample from that model to generate new data yy^*.

sample_shape = [5000] _, _, _, predictive_treatment_effects = model.sample( value=(tf.broadcast_to(np.mean(avg_effect, 0), sample_shape), tf.broadcast_to(np.mean(avg_stddev, 0), sample_shape), tf.broadcast_to(np.mean(school_effects_standard, 0), sample_shape + [num_schools]), None))
fig, axes = plt.subplots(4, 2, sharex=True, sharey=True) fig.set_size_inches(12, 10) fig.tight_layout() for i, ax in enumerate(axes): sns.kdeplot(predictive_treatment_effects[:, 2*i].numpy(), ax=ax[0], shade=True) ax[0].title.set_text( "School {} treatment effect posterior predictive".format(2*i)) sns.kdeplot(predictive_treatment_effects[:, 2*i + 1].numpy(), ax=ax[1], shade=True) ax[1].title.set_text( "School {} treatment effect posterior predictive".format(2*i + 1)) plt.show()
Image in a Jupyter notebook
# The mean predicted treatment effects for each of the eight schools. prediction = np.mean(predictive_treatment_effects, axis=0)

We can look at the residuals between the treatment effects data and the predictions of the model posterior. These correspond with the plot above which shows the shrinkage of the estimated effects toward the population average.

treatment_effects - prediction
array([14.905351 , 1.2838383, -5.6966295, 0.8327627, -2.3356671, -2.0363257, 5.997898 , 4.3731265], dtype=float32)

Because we have a distribution of predictions for each school, we can consider the distribution of residuals as well.

residuals = treatment_effects - predictive_treatment_effects
fig, axes = plt.subplots(4, 2, sharex=True, sharey=True) fig.set_size_inches(12, 10) fig.tight_layout() for i, ax in enumerate(axes): sns.kdeplot(residuals[:, 2*i].numpy(), ax=ax[0], shade=True) ax[0].title.set_text( "School {} treatment effect residuals".format(2*i)) sns.kdeplot(residuals[:, 2*i + 1].numpy(), ax=ax[1], shade=True) ax[1].title.set_text( "School {} treatment effect residuals".format(2*i + 1)) plt.show()
Image in a Jupyter notebook

Acknowledgements

This tutorial was originally written in Edward 1.0 (source). We thank all contributors to writing and revising that version.

References

  1. Donald B. Rubin. Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4):377-401, 1981.

  2. Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin. Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC, 2013.