Path: blob/master/site/en-snapshot/probability/examples/Multilevel_Modeling_Primer.ipynb
25118 views
Copyright 2019 The TensorFlow Probability Authors.
Licensed under the Apache License, Version 2.0 (the "License");
Multilevel Modeling Primer in TensorFlow Probability
This example is ported from the PyMC3 example notebook A Primer on Bayesian Methods for Multilevel Modeling
Dependencies & Prerequisites
1 Introduction
In this colab we will fit hierarchical linear models (HLMs) of various degrees of model complexity using the popular Radon dataset. We will make use of TFP primitives and its Markov Chain Monte Carlo toolset.
To better fit the data, our goal is to make use of the natural hierarchical structure present in the dataset. We begin with conventional approaches: completely pooled and unpooled models. We continue with multilevel models: exploring partial pooling models, group-level predictors and contextual effects.
For a related notebook also fitting HLMs using TFP on the Radon dataset, check out Linear Mixed-Effect Regression in {TF Probability, R, Stan}.
If you have any questions about the material here, don't hesitate to contact (or join) the TensorFlow Probability mailing list. We're happy to help.
2 Multilevel Modeling Overview
A Primer on Bayesian Methods for Multilevel Modeling
Hierarchical or multilevel modeling is a generalization of regression modeling.
Multilevel models are regression models in which the constituent model parameters are given probability distributions. This implies that model parameters are allowed to vary by group. Observational units are often naturally clustered. Clustering induces dependence between observations, despite random sampling of clusters and random sampling within clusters.
A hierarchical model is a particular multilevel model where parameters are nested within one another. Some multilevel structures are not hierarchical.
e.g. "country" and "year" are not nested, but may represent separate, but overlapping, clusters of parameters We will motivate this topic using an environmental epidemiology example.
Example: Radon contamination (Gelman and Hill 2006)
Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household.
The EPA did a study of radon levels in 80,000 houses. Two important predictors are: 1. Measurement in the basement or the first floor (radon higher in basements) 2. County uranium level (positive correlation with radon levels)
We will focus on modeling radon levels in Minnesota. The hierarchy in this example is households within each county.
3 Data Munging
In this section we obtain the radon
dataset and do some minimal preprocessing.
Distribution of radon levels (log scale):
4 Conventional Approaches
The two conventional alternatives to modeling radon exposure represent the two extremes of the bias-variance tradeoff:
Complete Pooling:
Treat all counties the same, and estimate a single radon level.
No Pooling:
Model radon in each county independently.
The errors may represent measurement error, temporal within-house variation, or variation among houses.
Below, we fit the complete pooling model using Hamiltonian Monte Carlo.
Plot the point estimates of the slope and intercept for the complete pooling model.
Next, we estimate the radon levels for each county in the unpooled model.
Here are the unpooled county expected values for the intercept along with with 95% credible intervals for each chain. We also report R-hat value for each county's estimate.
We can plot the ordered estimates to identify counties with high radon levels:
Here are visual comparisons between the pooled and unpooled estimates for a subset of counties representing a range of sample sizes.
Neither of these models are satisfactory:
if we are trying to identify high-radon counties, pooling is not useful.
we do not trust extreme unpooled estimates produced by models using few observations.
5 Multilevel and Hierarchical models
When we pool our data, we lose the information that different data points came from different counties. This means that each radon
-level observation is sampled from the same probability distribution. Such a model fails to learn any variation in the sampling unit that is inherent within a group (e.g. a county). It only accounts for sampling variance.
When we analyze data unpooled, we imply that they are sampled independently from separate models. At the opposite extreme from the pooled case, this approach claims that differences between sampling units are too large to combine them:
In a hierarchical model, parameters are viewed as a sample from a population distribution of parameters. Thus, we view them as being neither entirely different or exactly the same. This is known as partial pooling.
5.1 Partial Pooling
The simplest partial pooling model for the household radon dataset is one which simply estimates radon levels, without any predictors at either the group or individual level. An example of a individual-level predictor is whether the data point is from the basement or the first floor. A group-level predictor can be the county-wide mean uranium levels.
A partial pooling model represents a compromise between the pooled and unpooled extremes, approximately a weighted average (based on sample size) of the unpooled county estimates and the pooled estimates.
Let be the estimated log-radon level in county . It's just an intercept; we ignore slopes for now. is the number of observations from county . and are the variance within the parameter and the sampling variance respectively. Then a partial pooling model could posit:
We expect the following when using partial pooling:
Estimates for counties with smaller sample sizes will shrink towards the state-wide average.
Estimates for counties with larger sample sizes will be closer to the unpooled county estimates.
Notice the difference between the unpooled and partially-pooled estimates, particularly at smaller sample sizes. The former are both more extreme and more imprecise.
5.2 Varying Intercepts
We now a consider a more complex model that allows intercepts to vary across county, according to a random effect.
The slope , which lets the observation vary according to the location of measurement (basement or first floor), is still a fixed effect shared between different counties.
As with the the unpooling model, we set a separate intercept for each county, but rather than fitting separate least squares regression models for each county, multilevel modeling shares strength among counties, allowing for more reasonable inference in counties with little data.
The estimate for the floor coefficient is approximately -0.69, which can be interpreted as houses without basements having about half () the radon levels of those with basements, after accounting for county.
5.3 Varying Slopes
Alternatively, we can posit a model that allows the counties to vary according to how the location of measurement (basement or first floor) influences the radon reading. In this case the intercept is shared between counties.
5.4 Varying Intercepts and Slopes
The most general model allows both the intercept and slope to vary by county:
6 Adding Group-level Predictors
A primary strength of multilevel models is the ability to handle predictors on multiple levels simultaneously. If we consider the varying-intercepts model above:
random effect to describe variation in the expected radon value, specify another regression model with a county-level covariate. Here, we use the county uranium reading , which is thought to be related to radon levels:
\sigma_{\alpha}^2)$$ Thus, we are now incorporating a house-level predictor (floor or basement) as well as a county-level predictor (uranium).
Note that the model has both indicator variables for each county, plus a county-level covariate. In classical regression, this would result in collinearity. In a multilevel model, the partial pooling of the intercepts towards the expected value of the group-level linear model avoids this.
Group-level predictors also serve to reduce group-level variation . An important implication of this is that the group-level estimate induces stronger pooling.
The standard errors on the intercepts are narrower than for the partial-pooling model without a county-level covariate.
6.2 Correlations Among Levels
In some instances, having predictors at multiple levels can reveal correlation between individual-level variables and group residuals. We can account for this by including the average of the individual predictors as a covariate in the model for the group intercept.
broadly referred to as contextual effects.
So, we might infer from this that counties with higher proportions of houses without basements tend to have higher baseline levels of radon. Perhaps this is related to the soil type, which in turn might influence what type of structures are built.
6.3 Prediction
Gelman (2006) used cross-validation tests to check the prediction error of the unpooled, pooled, and partially-pooled models.
Root mean squared cross-validation prediction errors:
unpooled = 0.86
pooled = 0.84
multilevel = 0.79
There are two types of prediction that can be made in a multilevel model:
A new individual within an existing group
A new individual within a new group
For example, if we wanted to make a prediction for a new house with no basement in St. Louis county, we just need to sample from the radon model with the appropriate intercept.
That is,
7 Conclusions
Benefits of Multilevel Models:
Accounting for natural hierarchical structure of observational data.
Estimation of coefficients for (under-represented) groups.
Incorporating individual- and group-level information when estimating group-level coefficients.
Allowing for variation among individual-level coefficients across groups.
References
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Cambridge University Press.
Gelman, A. (2006). Multilevel (Hierarchical) modeling: what it can and cannot do. Technometrics, 48(3), 432–435.