Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/pt-br/probability/examples/Multilevel_Modeling_Primer.ipynb
25118 views
Kernel: Python 3

Licensed under the Apache License, Version 2.0 (the "License");

#@title Licensed under the Apache License, Version 2.0 (the "License"); { display-mode: "form" } # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

Cartilha sobre modelagem multinível no TensorFlow Probability

Para este exemplo, foi feita uma portabilidade do notebook de exemplo do PyMC3 Cartilha sobre métodos bayesianos para modelagem multinível.

Dependências e pré-requisitos

#@title Import { display-mode: "form" } import collections import os from six.moves import urllib import daft as daft import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns import warnings import tensorflow.compat.v2 as tf tf.enable_v2_behavior() import tensorflow_datasets as tfds import tensorflow_probability as tfp tfk = tf.keras tfkl = tf.keras.layers tfpl = tfp.layers tfd = tfp.distributions tfb = tfp.bijectors warnings.simplefilter('ignore')

1 – Introdução

Neste Colab, faremos o ajuste de modelos lineares hierárquicos (HLMs, na sigla em inglês) com vários níveis de complexidade do modelo usando o dataset popular Radon. Usaremos primitivos do TFP e seu conjunto de ferramentas de Monte Carlo via Cadeias de Markov.

Para ajustar melhor os dados, nosso objetivo é usar a estrutura hierárquica natural presente no dataset. Começamos com estratégias convencionais: modelos com pooling completo e sem pooling. Continuaremos com modelos multiníveis: exploração de modelos com pooling parcial, preditores em nível de grupo e efeitos contextuais.

Para ver um notebook de ajuste de HLMs usando TFP no dataset Radon relacionado a este assunto, confira Regressão linear com efeitos mistos no {TF Probability, R, Stan}.

Se você tiver alguma dúvida sobre este material, fique à vontade para entrar em contato ou participar da lista de e-mails do TensorFlow Probability. Será um grande prazer ajudar.

2 – Visão geral da modelagem multinível

Cartilha sobre métodos bayesianos para modelagem multinível

A modelagem hierárquica ou multinível é uma generalização da modelagem de regressão.

Os modelos multiníveis são modelos de regressão em que os parâmetros do modelo constituintes são distribuições de probabilidade fornecidas, o que implica que os parâmetros do modelo podem variar por grupo. As unidades observacionais costumam ser naturalmente agrupadas. O agrupamento induz a dependência entre as observações, apesar da amostragem aleatória de agrupamentos (clusters) e da amostragem aleatória dentro dos agrupamentos.

Um modelo hierárquico é um modelo multinível específico em que os parâmetros são aninhados um dentro do outro. Algumas estruturas multiníveis não são hierárquicas.

Por exemplo: "país" e "ano" não estão aninhados, mas podem representar agrupamentos separados, porém sobrepostos, de parâmetros. Vamos fornecer a motivação para este tópico usando um exemplo de epidemiologia ambiental.

Exemplo: contaminação por radônio (Gelman and Hill 2006)

O radônio é um gás radioativo que entra nas residências por meio de pontos de contato com o chão. É um carcinogênico, a principal causa de câncer no pulmão em não fumantes. Os níveis de radônio variam consideravelmente entre as residências.

O Departamento de Proteção Ambiental dos Estados Unidos (EPA, na sigla em inglês) fez um estudo sobre os níveis de radônio em 80 mil casas. Dois importantes preditores são: 1. Medição no porão ou térreo (radônio maior em porões); 2. Nível de urânio no condado (possível correlação com os níveis de radônio).

Vamos nos concentrar na modelagem dos níveis de radônio em Minnesota. A hierarquia neste exemplo são residências em cada condado.

3 – Transformação dos dados

Nesta seção, vamos obter o dataset radon e fazer um pré-processamento mínimo.

def load_and_preprocess_radon_dataset(state='MN'): """Preprocess Radon dataset as done in "Bayesian Data Analysis" book. We filter to Minnesota data (919 examples) and preprocess to obtain the following features: - `log_uranium_ppm`: Log of soil uranium measurements. - `county`: Name of county in which the measurement was taken. - `floor`: Floor of house (0 for basement, 1 for first floor) on which the measurement was taken. The target variable is `log_radon`, the log of the Radon measurement in the house. """ ds = tfds.load('radon', split='train') radon_data = tfds.as_dataframe(ds) radon_data.rename(lambda s: s[9:] if s.startswith('feat') else s, axis=1, inplace=True) df = radon_data[radon_data.state==state.encode()].copy() # For any missing or invalid activity readings, we'll use a value of `0.1`. df['radon'] = df.activity.apply(lambda x: x if x > 0. else 0.1) # Make county names look nice. df['county'] = df.county.apply(lambda s: s.decode()).str.strip().str.title() # Remap categories to start from 0 and end at max(category). county_name = sorted(df.county.unique()) df['county'] = df.county.astype( pd.api.types.CategoricalDtype(categories=county_name)).cat.codes county_name = list(map(str.strip, county_name)) df['log_radon'] = df['radon'].apply(np.log) df['log_uranium_ppm'] = df['Uppm'].apply(np.log) df = df[['idnum', 'log_radon', 'floor', 'county', 'log_uranium_ppm']] return df, county_name
radon, county_name = load_and_preprocess_radon_dataset() num_counties = len(county_name) num_observations = len(radon)
# Create copies of variables as Tensors. county = tf.convert_to_tensor(radon['county'], dtype=tf.int32) floor = tf.convert_to_tensor(radon['floor'], dtype=tf.float32) log_radon = tf.convert_to_tensor(radon['log_radon'], dtype=tf.float32) log_uranium = tf.convert_to_tensor(radon['log_uranium_ppm'], dtype=tf.float32)
radon.head()

Distribuição dos níveis de radônio (escala logarítmica):

plt.hist(log_radon.numpy(), bins=25, edgecolor='white') plt.xlabel("Histogram of Radon levels (Log Scale)") plt.show()
Image in a Jupyter notebook

4 – Estratégias convencionais

As duas alternativas convencionais para modelar a exposição ao radônio representam os dois extremos da contrapartida bias-variância:

Pooling completo:

Trate todos os condados como o mesmo e estime um único nível de radônio. yi=α+βxi+ϵiy_i = \alpha + \beta x_i + \epsilon_i

Sem pooling:

Modele o radônio em cada condado de forma independente.

yi=αj[i]+βxi+ϵiy_i = \alpha_{j[i]} + \beta x_i + \epsilon_i

Os erros ϵi\epsilon_i podem representar o erro de medição, a variação temporal dentro das casas ou a variação entre as casas.

#@title 4.1 Complete Pooling Model pgm = daft.PGM([7, 3.5], node_unit=1.2) pgm.add_node( daft.Node( "alpha_prior", r"$\mathcal{N}(0, 10^5)$", 1, 3, fixed=True, offset=(0, 5))) pgm.add_node( daft.Node( "beta_prior", r"$\mathcal{N}(0, 10^5)$", 2.5, 3, fixed=True, offset=(10, 5))) pgm.add_node( daft.Node( "sigma_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 4.5, 3, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("alpha", r"$\alpha$", 1, 2)) pgm.add_node(daft.Node("beta", r"$\beta$", 2.5, 2)) pgm.add_node(daft.Node("sigma", r"$\sigma$", 4.5, 2)) pgm.add_node( daft.Node( "y_i", r"$y_i \sim \mathcal{N}$", 3, 1, scale=1.25, observed=True)) pgm.add_edge("alpha_prior", "alpha") pgm.add_edge("beta_prior", "beta") pgm.add_edge("sigma_prior", "sigma") pgm.add_edge("sigma", "y_i") pgm.add_edge("alpha", "y_i") pgm.add_edge("beta", "y_i") pgm.add_plate(daft.Plate([2.3, 0.1, 1.4, 1.4], "$i = 1:919$")) pgm.render() plt.show()
Image in a Jupyter notebook

Abaixo, ajustamos o modelo com pooling completo usando Monte Carlo Hamiltoniano.

@tf.function def affine(x, kernel_diag, bias=tf.zeros([])): """`kernel_diag * x + bias` with broadcasting.""" kernel_diag = tf.ones_like(x) * kernel_diag bias = tf.ones_like(x) * bias return x * kernel_diag + bias
def pooled_model(floor): """Creates a joint distribution representing our generative process.""" return tfd.JointDistributionSequential([ tfd.Normal(loc=0., scale=1e5), # alpha tfd.Normal(loc=0., scale=1e5), # beta tfd.HalfCauchy(loc=0., scale=5), # sigma lambda s, b1, b0: tfd.MultivariateNormalDiag( # y loc=affine(floor, b1[..., tf.newaxis], b0[..., tf.newaxis]), scale_identity_multiplier=s) ]) @tf.function def pooled_log_prob(alpha, beta, sigma): """Computes `joint_log_prob` pinned at `log_radon`.""" return pooled_model(floor).log_prob([alpha, beta, sigma, log_radon])
@tf.function def sample_pooled(num_chains, num_results, num_burnin_steps, num_observations): """Samples from the pooled model.""" hmc = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=pooled_log_prob, num_leapfrog_steps=10, step_size=0.005) initial_state = [ tf.zeros([num_chains], name='init_alpha'), tf.zeros([num_chains], name='init_beta'), tf.ones([num_chains], name='init_sigma') ] # Constrain `sigma` to the positive real axis. Other variables are # unconstrained. unconstraining_bijectors = [ tfb.Identity(), # alpha tfb.Identity(), # beta tfb.Exp() # sigma ] kernel = tfp.mcmc.TransformedTransitionKernel( inner_kernel=hmc, bijector=unconstraining_bijectors) samples, kernel_results = tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=initial_state, kernel=kernel) acceptance_probs = tf.reduce_mean( tf.cast(kernel_results.inner_results.is_accepted, tf.float32), axis=0) return samples, acceptance_probs
PooledModel = collections.namedtuple('PooledModel', ['alpha', 'beta', 'sigma']) samples, acceptance_probs = sample_pooled( num_chains=4, num_results=1000, num_burnin_steps=1000, num_observations=num_observations) print('Acceptance Probabilities for each chain: ', acceptance_probs.numpy()) pooled_samples = PooledModel._make(samples)
Probabilidade de aceitação para cada cadeia: [0.999 0.996 0.995 0.995]
for var, var_samples in pooled_samples._asdict().items(): print('R-hat for ', var, ':\t', tfp.mcmc.potential_scale_reduction(var_samples).numpy())
R-hat para alpha: 1.0019042 R-hat para beta: 1.0135655 R-hat para sigma: 0.99958754
def reduce_samples(var_samples, reduce_fn): """Reduces across leading two dims using reduce_fn.""" # Collapse the first two dimensions, typically (num_chains, num_samples), and # compute np.mean or np.std along the remaining axis. if isinstance(var_samples, tf.Tensor): var_samples = var_samples.numpy() # convert to numpy array var_samples = np.reshape(var_samples, (-1,) + var_samples.shape[2:]) return np.apply_along_axis(reduce_fn, axis=0, arr=var_samples) sample_mean = lambda samples : reduce_samples(samples, np.mean)

Plote as estimativas de pontos do declive e intercepto para o modelo com pooling completo.

LinearEstimates = collections.namedtuple('LinearEstimates', ['intercept', 'slope']) pooled_estimate = LinearEstimates( intercept=sample_mean(pooled_samples.alpha), slope=sample_mean(pooled_samples.beta) ) plt.scatter(radon.floor, radon.log_radon) xvals = np.linspace(-0.2, 1.2) plt.ylabel('Radon level (Log Scale)') plt.xticks([0, 1], ['Basement', 'First Floor']) plt.plot(xvals, pooled_estimate.intercept + pooled_estimate.slope * xvals, 'r--') plt.show()
Image in a Jupyter notebook
#@title Utility function to plot traces of sampled variables. def plot_traces(var_name, samples, num_chains): if isinstance(samples, tf.Tensor): samples = samples.numpy() # convert to numpy array fig, axes = plt.subplots(1, 2, figsize=(14, 1.5), sharex='col', sharey='col') for chain in range(num_chains): axes[0].plot(samples[:, chain], alpha=0.7) axes[0].title.set_text("'{}' trace".format(var_name)) sns.kdeplot(samples[:, chain], ax=axes[1], shade=False) axes[1].title.set_text("'{}' distribution".format(var_name)) axes[0].set_xlabel('Iteration') axes[1].set_xlabel(var_name) plt.show()
for var, var_samples in pooled_samples._asdict().items(): plot_traces(var, samples=var_samples, num_chains=4)
Image in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebook

Agora, estimamos os níveis de radônio para cada condado no modelo sem pooling.

#@title 4.2 Unpooled Model { display-mode: "form" } pgm = daft.PGM([7, 3.5], node_unit=1.2) pgm.add_node( daft.Node( "alpha_prior", r"$\mathcal{N}(0, 10^5)$", 1, 3, fixed=True, offset=(0, 5))) pgm.add_node( daft.Node( "beta_prior", r"$\mathcal{N}(0, 10^5)$", 2.5, 3, fixed=True, offset=(10, 5))) pgm.add_node( daft.Node( "sigma_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 4.5, 3, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("alpha", r"$\alpha$", 1, 2)) pgm.add_node(daft.Node("beta", r"$\beta$", 2.5, 2)) pgm.add_node(daft.Node("sigma", r"$\sigma$", 4.5, 2)) pgm.add_node( daft.Node( "y_i", r"$y_i \sim \mathcal{N}$", 3, 1, scale=1.25, observed=True)) pgm.add_edge("alpha_prior", "alpha") pgm.add_edge("beta_prior", "beta") pgm.add_edge("sigma_prior", "sigma") pgm.add_edge("sigma", "y_i") pgm.add_edge("alpha", "y_i") pgm.add_edge("beta", "y_i") pgm.add_plate(daft.Plate([0.3, 1.1, 1.4, 1.4], "$i = 1:85$")) pgm.add_plate(daft.Plate([2.3, 0.1, 1.4, 1.4], "$i = 1:919$")) pgm.render() plt.show()
Image in a Jupyter notebook
def unpooled_model(floor, county): """Creates a joint distribution for the unpooled model.""" return tfd.JointDistributionSequential([ tfd.MultivariateNormalDiag( # alpha loc=tf.zeros([num_counties]), scale_identity_multiplier=1e5), tfd.Normal(loc=0., scale=1e5), # beta tfd.HalfCauchy(loc=0., scale=5), # sigma lambda s, b1, b0: tfd.MultivariateNormalDiag( # y loc=affine( floor, b1[..., tf.newaxis], tf.gather(b0, county, axis=-1)), scale_identity_multiplier=s) ]) @tf.function def unpooled_log_prob(beta0, beta1, sigma): """Computes `joint_log_prob` pinned at `log_radon`.""" return ( unpooled_model(floor, county).log_prob([beta0, beta1, sigma, log_radon]))
@tf.function def sample_unpooled(num_chains, num_results, num_burnin_steps): """Samples from the unpooled model.""" # Initialize the HMC transition kernel. hmc = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=unpooled_log_prob, num_leapfrog_steps=10, step_size=0.025) initial_state = [ tf.zeros([num_chains, num_counties], name='init_beta0'), tf.zeros([num_chains], name='init_beta1'), tf.ones([num_chains], name='init_sigma') ] # Contrain `sigma` to the positive real axis. Other variables are # unconstrained. unconstraining_bijectors = [ tfb.Identity(), # alpha tfb.Identity(), # beta tfb.Exp() # sigma ] kernel = tfp.mcmc.TransformedTransitionKernel( inner_kernel=hmc, bijector=unconstraining_bijectors) samples, kernel_results = tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=initial_state, kernel=kernel) acceptance_probs = tf.reduce_mean( tf.cast(kernel_results.inner_results.is_accepted, tf.float32), axis=0) return samples, acceptance_probs
UnpooledModel = collections.namedtuple('UnpooledModel', ['alpha', 'beta', 'sigma']) samples, acceptance_probs = sample_unpooled( num_chains=4, num_results=1000, num_burnin_steps=1000) print('Acceptance Probabilities: ', acceptance_probs.numpy()) unpooled_samples = UnpooledModel._make(samples) print('R-hat for beta:', tfp.mcmc.potential_scale_reduction(unpooled_samples.beta).numpy()) print('R-hat for sigma:', tfp.mcmc.potential_scale_reduction(unpooled_samples.sigma).numpy())
Probabilidades de aceitação: [0.895 0.897 0.893 0.901] R-hat para beta: 1.0052257 R-hat para sigma: 1.0035229
plot_traces(var_name='beta', samples=unpooled_samples.beta, num_chains=4) plot_traces(var_name='sigma', samples=unpooled_samples.sigma, num_chains=4)
Image in a Jupyter notebookImage in a Jupyter notebook

Aqui estão os valores esperados de condados sem pooling para o intercepto junto com os 95% intervalos de credibilidade para cada cadeia. Também indicamos o valor R-hat para a estimativa de cada condado.

#@title Utility function for Forest plots. def forest_plot(num_chains, num_vars, var_name, var_labels, samples): fig, axes = plt.subplots( 1, 2, figsize=(12, 15), sharey=True, gridspec_kw={'width_ratios': [3, 1]}) for var_idx in range(num_vars): values = samples[..., var_idx] rhat = tfp.mcmc.diagnostic.potential_scale_reduction(values).numpy() meds = np.median(values, axis=-2) los = np.percentile(values, 5, axis=-2) his = np.percentile(values, 95, axis=-2) for i in range(num_chains): height = 0.1 + 0.3 * var_idx + 0.05 * i axes[0].plot([los[i], his[i]], [height, height], 'C0-', lw=2, alpha=0.5) axes[0].plot([meds[i]], [height], 'C0o', ms=1.5) axes[1].plot([rhat], [height], 'C0o', ms=4) axes[0].set_yticks(np.linspace(0.2, 0.3, num_vars)) axes[0].set_ylim(0, 26) axes[0].grid(which='both') axes[0].invert_yaxis() axes[0].set_yticklabels(var_labels) axes[0].xaxis.set_label_position('top') axes[0].set(xlabel='95% Credible Intervals for {}'.format(var_name)) axes[1].set_xticks([1, 2]) axes[1].set_xlim(0.95, 2.05) axes[1].grid(which='both') axes[1].set(xlabel='R-hat') axes[1].xaxis.set_label_position('top') plt.show()
forest_plot( num_chains=4, num_vars=num_counties, var_name='alpha', var_labels=county_name, samples=unpooled_samples.alpha.numpy())
Image in a Jupyter notebook

Podemos plotar as estimativas ordenadas para identificar os condados com níveis altos de radônio:

unpooled_intercepts = reduce_samples(unpooled_samples.alpha, np.mean) unpooled_intercepts_se = reduce_samples(unpooled_samples.alpha, np.std) def plot_ordered_estimates(): means = pd.Series(unpooled_intercepts, index=county_name) std_errors = pd.Series(unpooled_intercepts_se, index=county_name) order = means.sort_values().index plt.plot(range(num_counties), means[order], '.') for i, m, se in zip(range(num_counties), means[order], std_errors[order]): plt.plot([i, i], [m - se, m + se], 'C0-') plt.xlabel('Ordered county') plt.ylabel('Radon estimate') plt.show() plot_ordered_estimates()
Image in a Jupyter notebook
#@title Utility function to plot estimates for a sample set of counties. def plot_estimates(linear_estimates, labels, sample_counties): fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True) axes = axes.ravel() intercepts_indexed = [] slopes_indexed = [] for intercepts, slopes in linear_estimates: intercepts_indexed.append(pd.Series(intercepts, index=county_name)) slopes_indexed.append(pd.Series(slopes, index=county_name)) markers = ['-', 'r--', 'k:'] sample_county_codes = [county_name.index(c) for c in sample_counties] for i, c in enumerate(sample_county_codes): y = radon.log_radon[radon.county == c] x = radon.floor[radon.county == c] axes[i].scatter( x + np.random.randn(len(x)) * 0.01, y, alpha=0.4, label='Log Radon') # Plot both models and data xvals = np.linspace(-0.2, 1.2) for k in range(len(intercepts_indexed)): axes[i].plot( xvals, intercepts_indexed[k][c] + slopes_indexed[k][c] * xvals, markers[k], label=labels[k]) axes[i].set_xticks([0, 1]) axes[i].set_xticklabels(['Basement', 'First Floor']) axes[i].set_ylim(-1, 3) axes[i].set_title(sample_counties[i]) if not i % 2: axes[i].set_ylabel('Log Radon level') axes[3].legend(bbox_to_anchor=(1.05, 0.9), borderaxespad=0.) plt.show()

Aqui estão os comparativos visuais entre as estimativas com pooling e sem pooling para um subconjunto de condados que representa um intervalo de tamanhos da amostra.

unpooled_estimates = LinearEstimates( sample_mean(unpooled_samples.alpha), sample_mean(unpooled_samples.beta) ) sample_counties = ('Lac Qui Parle', 'Aitkin', 'Koochiching', 'Douglas', 'Clay', 'Stearns', 'Ramsey', 'St Louis') plot_estimates( linear_estimates=[unpooled_estimates, pooled_estimate], labels=['Unpooled Estimates', 'Pooled Estimates'], sample_counties=sample_counties)
Image in a Jupyter notebook

Nenhum desses modelos é satisfatório:

  • Se estamos tentando identificar condados com altos níveis de radônio, o pooling não é útil.

  • Não confiamos em estimativas extremas sem pooling geradas por modelos usando poucas observações.

5 – Modelos multiníveis e hierárquicos

Quando fazemos o pool dos dados, perdemos as informações de que diferentes pontos de dados vêm de diferentes condados. Portanto, cada observação do nível de radon é amostrada a partir da mesma distribuição de probabilidades. Um modelo como esse não consegue aprender qualquer variação na unidade de amostragem que é inerente em um grupo (por exemplo: um condado); ele só leva em conta a variância da amostragem.

#@title { display-mode: "form" } mpl.rc("font", size=18) pgm = daft.PGM([13.6, 2.2], origin=[1.15, 1.0], node_ec="none") pgm.add_node(daft.Node("parameter", r"parameter", 2.0, 3)) pgm.add_node(daft.Node("observations", r"observations", 2.0, 2)) pgm.add_node(daft.Node("theta", r"$\theta$", 5.5, 3)) pgm.add_node(daft.Node("y_0", r"$y_0$", 4, 2)) pgm.add_node(daft.Node("y_1", r"$y_1$", 5, 2)) pgm.add_node(daft.Node("dots", r"$\cdots$", 6, 2)) pgm.add_node(daft.Node("y_k", r"$y_k$", 7, 2)) pgm.add_edge("theta", "y_0") pgm.add_edge("theta", "y_1") pgm.add_edge("theta", "y_k") pgm.render() plt.show()
Image in a Jupyter notebook

Quando analisamos os dados sem pooling, sugerimos que eles são amostrados de forma independente a partir de modelos separados. No extremo oposto do caso com pooling, essa estratégia alega que as diferenças entre as unidades de amostragem são grandes demais para combiná-las:

#@title { display-mode: "form" } mpl.rc("font", size=18) pgm = daft.PGM([13.6, 2.2], origin=[1.15, 1.0], node_ec="none") pgm.add_node(daft.Node("parameter", r"parameter", 2.0, 3)) pgm.add_node(daft.Node("observations", r"observations", 2.0, 2)) pgm.add_node(daft.Node("theta_0", r"$\theta_0$", 4, 3)) pgm.add_node(daft.Node("theta_1", r"$\theta_1$", 5, 3)) pgm.add_node(daft.Node("theta_dots", r"$\cdots$", 6, 3)) pgm.add_node(daft.Node("theta_k", r"$\theta_k$", 7, 3)) pgm.add_node(daft.Node("y_0", r"$y_0$", 4, 2)) pgm.add_node(daft.Node("y_1", r"$y_1$", 5, 2)) pgm.add_node(daft.Node("y_dots", r"$\cdots$", 6, 2)) pgm.add_node(daft.Node("y_k", r"$y_k$", 7, 2)) pgm.add_edge("theta_0", "y_0") pgm.add_edge("theta_1", "y_1") pgm.add_edge("theta_k", "y_k") pgm.render() plt.show()
Image in a Jupyter notebook

Em um modelo hierárquico, os parâmetros são vistos como uma amostra de uma distribuição de população de parâmetros. Portanto, nós os enxergamos como nem sendo totalmente diferentes nem sendo exatamente iguais. Isso é conhecido como pooling parcial.

#@title { display-mode: "form" } mpl.rc("font", size=18) pgm = daft.PGM([13.6, 3.4], origin=[1.15, 1.0], node_ec="none") pgm.add_node(daft.Node("model", r"model", 2.0, 4)) pgm.add_node(daft.Node("parameter", r"parameter", 2.0, 3)) pgm.add_node(daft.Node("observations", r"observations", 2.0, 2)) pgm.add_node(daft.Node("mu_sigma", r"$\mu,\sigma^2$", 5.5, 4)) pgm.add_node(daft.Node("theta_0", r"$\theta_0$", 4, 3)) pgm.add_node(daft.Node("theta_1", r"$\theta_1$", 5, 3)) pgm.add_node(daft.Node("theta_dots", r"$\cdots$", 6, 3)) pgm.add_node(daft.Node("theta_k", r"$\theta_k$", 7, 3)) pgm.add_node(daft.Node("y_0", r"$y_0$", 4, 2)) pgm.add_node(daft.Node("y_1", r"$y_1$", 5, 2)) pgm.add_node(daft.Node("y_dots", r"$\cdots$", 6, 2)) pgm.add_node(daft.Node("y_k", r"$y_k$", 7, 2)) pgm.add_edge("mu_sigma", "theta_0") pgm.add_edge("mu_sigma", "theta_1") pgm.add_edge("mu_sigma", "theta_k") pgm.add_edge("theta_0", "y_0") pgm.add_edge("theta_1", "y_1") pgm.add_edge("theta_k", "y_k") pgm.render() plt.show()
Image in a Jupyter notebook

5.1 – Pooling parcial

O modelo mais simples com pooling parcial para o dataset de radônio em residências é um que simplesmente estima os níveis de radônio, sem qualquer preditor no nível de grupo ou individual. Um exemplo de preditor em nível individual é se o ponto de dados vem do porão ou do térreo. Um preditor em nível de grupo pode ser os níveis médios de urânio em todo o condado.

Um modelo com pooling parcial representa uma contrapartida entre os extremos com e sem pooling, aproximadamente uma média ponderada (baseada no tamanho da amostra) das estimativas de condado sem pooling e das estimativas com pooling.

Seja α^j\hat{\alpha}j o nível logarítmico de radônio estimado no condado jj. É apenas um intercepto; ignoramos os declives por ora. njn_j é o número de observações do condado jj. σα\sigma{\alpha} e σy\sigma_y são a variância do parâmetro e a variância de amostragem, respectivamente. Então, um modelo com pooling parcial poderia postular:

α^j(nj/σy2)yˉj+(1/σα2)yˉ(nj/σy2)+(1/σα2)\hat{\alpha}_j \approx \frac{(n_j/\sigma_y^2)\bar{y}_j + (1/\sigma_{\alpha}^2)\bar{y}}{(n_j/\sigma_y^2) + (1/\sigma_{\alpha}^2)}

Esperamos o seguinte ao usar pooling parcial:

  • As estimativas para condados com tamanhos menores de amostra vão tender para a média geral do estado.

  • As estimativas para condados com tamanhos maiores de amostras serão mais próximas das estimativas de condados sem pooling.

#@title { display-mode: "form" } mpl.rc("font", size=12) pgm = daft.PGM([7, 4.5], node_unit=1.2) pgm.add_node( daft.Node( "mu_a_prior", r"$\mathcal{N}(0, 10^5)$", 1, 4, fixed=True, offset=(0, 5))) pgm.add_node( daft.Node( "sigma_a_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 3, 4, fixed=True, offset=(10, 5))) pgm.add_node( daft.Node( "sigma_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 4, 3, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("mu_a", r"$\mu_a$", 1, 3)) pgm.add_node(daft.Node("sigma_a", r"$\sigma_a$", 3, 3)) pgm.add_node(daft.Node("a", r"$a \sim \mathcal{N}$", 2, 2, scale=1.25)) pgm.add_node(daft.Node("sigma_y", r"$\sigma_y$", 4, 2)) pgm.add_node( daft.Node( "y_i", r"$y_i \sim \mathcal{N}$", 3.25, 1, scale=1.25, observed=True)) pgm.add_edge("mu_a_prior", "mu_a") pgm.add_edge("sigma_a_prior", "sigma_a") pgm.add_edge("mu_a", "a") pgm.add_edge("sigma_a", "a") pgm.add_edge("sigma_prior", "sigma_y") pgm.add_edge("sigma_y", "y_i") pgm.add_edge("a", "y_i") pgm.add_plate(daft.Plate([1.4, 1.2, 1.2, 1.4], "$i = 1:85$")) pgm.add_plate(daft.Plate([2.65, 0.2, 1.2, 1.4], "$i = 1:919$")) pgm.render() plt.show()
Image in a Jupyter notebook
def partial_pooling_model(county): """Creates a joint distribution for the partial pooling model.""" return tfd.JointDistributionSequential([ tfd.Normal(loc=0., scale=1e5), # mu_a tfd.HalfCauchy(loc=0., scale=5), # sigma_a lambda sigma_a, mu_a: tfd.MultivariateNormalDiag( # a loc=mu_a[..., tf.newaxis] * tf.ones([num_counties])[tf.newaxis, ...], scale_identity_multiplier=sigma_a), tfd.HalfCauchy(loc=0., scale=5), # sigma_y lambda sigma_y, a: tfd.MultivariateNormalDiag( # y loc=tf.gather(a, county, axis=-1), scale_identity_multiplier=sigma_y) ]) @tf.function def partial_pooling_log_prob(mu_a, sigma_a, a, sigma_y): """Computes joint log prob pinned at `log_radon`.""" return partial_pooling_model(county).log_prob( [mu_a, sigma_a, a, sigma_y, log_radon])
@tf.function def sample_partial_pooling(num_chains, num_results, num_burnin_steps): """Samples from the partial pooling model.""" hmc = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=partial_pooling_log_prob, num_leapfrog_steps=10, step_size=0.01) initial_state = [ tf.zeros([num_chains], name='init_mu_a'), tf.ones([num_chains], name='init_sigma_a'), tf.zeros([num_chains, num_counties], name='init_a'), tf.ones([num_chains], name='init_sigma_y') ] unconstraining_bijectors = [ tfb.Identity(), # mu_a tfb.Exp(), # sigma_a tfb.Identity(), # a tfb.Exp() # sigma_y ] kernel = tfp.mcmc.TransformedTransitionKernel( inner_kernel=hmc, bijector=unconstraining_bijectors) samples, kernel_results = tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=initial_state, kernel=kernel) acceptance_probs = tf.reduce_mean( tf.cast(kernel_results.inner_results.is_accepted, tf.float32), axis=0) return samples, acceptance_probs
PartialPoolingModel = collections.namedtuple( 'PartialPoolingModel', ['mu_a', 'sigma_a', 'a', 'sigma_y']) samples, acceptance_probs = sample_partial_pooling( num_chains=4, num_results=1000, num_burnin_steps=1000) print('Acceptance Probabilities: ', acceptance_probs.numpy()) partial_pooling_samples = PartialPoolingModel._make(samples)
Probabilidades de aceitação: [0.989 0.978 0.987 0.987]
for var in ['mu_a', 'sigma_a', 'sigma_y']: print( 'R-hat for ', var, '\t:', tfp.mcmc.potential_scale_reduction(getattr(partial_pooling_samples, var)).numpy())
R-hat para mu_a : 1.0276643 R-hat para sigma_a : 1.0204039 R-hat para sigma_y : 1.0008202
partial_pooling_intercepts = reduce_samples( partial_pooling_samples.a.numpy(), np.mean) partial_pooling_intercepts_se = reduce_samples( partial_pooling_samples.a.numpy(), np.std) def plot_unpooled_vs_partial_pooling_estimates(): fig, axes = plt.subplots(1, 2, figsize=(14, 6), sharex=True, sharey=True) # Order counties by number of observations (and add some jitter). num_obs_per_county = ( radon.groupby('county')['idnum'].count().values.astype(np.float32)) num_obs_per_county += np.random.normal(scale=0.5, size=num_counties) intercepts_list = [unpooled_intercepts, partial_pooling_intercepts] intercepts_se_list = [unpooled_intercepts_se, partial_pooling_intercepts_se] for ax, means, std_errors in zip(axes, intercepts_list, intercepts_se_list): ax.plot(num_obs_per_county, means, 'C0.') for n, m, se in zip(num_obs_per_county, means, std_errors): ax.plot([n, n], [m - se, m + se], 'C1-', alpha=.5) for ax in axes: ax.set_xscale('log') ax.set_xlabel('No. of Observations Per County') ax.set_xlim(1, 100) ax.set_ylabel('Log Radon Estimate (with Standard Error)') ax.set_ylim(0, 3) ax.hlines(partial_pooling_intercepts.mean(), .9, 125, 'k', '--', alpha=.5) axes[0].set_title('Unpooled Estimates') axes[1].set_title('Partially Pooled Estimates') plot_unpooled_vs_partial_pooling_estimates()
Image in a Jupyter notebook

Observe a diferença entre as estimativas sem pooling e com pooling parcial, especialmente com tamanhos menores de amostra. As estimativas sem pooling são mais extremas e mais imprecisas.

5.2 – Variação dos interceptos

Agora vamos considerar um modelo mais complexo que permite a variação dos interceptos entre os condados segundo um efeito aleatório.

yi=αj[i]+βxi+ϵiy_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i

O declive β\beta, que permite que a observação varie de acordo com o local da medição (porão ou térreo), ainda é um efeito fixo compartilhado entre diferentes condados.

Assim como no modelo sem pooling, definimos um intercepto separado para cada condado, mas, em vez de ajustar modelos de regressão de mínimos quadrados para cada condado, a modelagem multinível compartilha a força entre os condados, permitindo uma inferência mais razoável em condados com poucos dados.

#@title { display-mode: "form" } mpl.rc("font", size=12) pgm = daft.PGM([7, 4.5], node_unit=1.2) pgm.add_node( daft.Node( "mu_a_prior", r"$\mathcal{N}(0, 10^5)$", 1, 4, fixed=True, offset=(0, 5))) pgm.add_node( daft.Node( "sigma_a_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 3, 4, fixed=True, offset=(10, 5))) pgm.add_node( daft.Node( "b_prior", r"$\mathcal{N}(0, 10^5)$", 4, 3.5, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("b", r"$b$", 4, 2.5)) pgm.add_node( daft.Node( "sigma_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 6, 3.5, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("mu_a", r"$\mu_a$", 1, 3)) pgm.add_node(daft.Node("sigma_a", r"$\sigma_a$", 3, 3)) pgm.add_node(daft.Node("a", r"$a \sim \mathcal{N}$", 2, 2, scale=1.25)) pgm.add_node(daft.Node("sigma_y", r"$\sigma_y$", 6, 2.5)) pgm.add_node( daft.Node( "y_i", r"$y_i \sim \mathcal{N}$", 4, 1, scale=1.25, observed=True)) pgm.add_edge("mu_a_prior", "mu_a") pgm.add_edge("sigma_a_prior", "sigma_a") pgm.add_edge("mu_a", "a") pgm.add_edge("b_prior", "b") pgm.add_edge("sigma_a", "a") pgm.add_edge("sigma_prior", "sigma_y") pgm.add_edge("sigma_y", "y_i") pgm.add_edge("a", "y_i") pgm.add_edge("b", "y_i") pgm.add_plate(daft.Plate([1.4, 1.2, 1.2, 1.4], "$i = 1:85$")) pgm.add_plate(daft.Plate([3.4, 0.2, 1.2, 1.4], "$i = 1:919$")) pgm.render() plt.show()
Image in a Jupyter notebook
def varying_intercept_model(floor, county): """Creates a joint distribution for the varying intercept model.""" return tfd.JointDistributionSequential([ tfd.Normal(loc=0., scale=1e5), # mu_a tfd.HalfCauchy(loc=0., scale=5), # sigma_a lambda sigma_a, mu_a: tfd.MultivariateNormalDiag( # a loc=affine(tf.ones([num_counties]), mu_a[..., tf.newaxis]), scale_identity_multiplier=sigma_a), tfd.Normal(loc=0., scale=1e5), # b tfd.HalfCauchy(loc=0., scale=5), # sigma_y lambda sigma_y, b, a: tfd.MultivariateNormalDiag( # y loc=affine(floor, b[..., tf.newaxis], tf.gather(a, county, axis=-1)), scale_identity_multiplier=sigma_y) ]) def varying_intercept_log_prob(mu_a, sigma_a, a, b, sigma_y): """Computes joint log prob pinned at `log_radon`.""" return varying_intercept_model(floor, county).log_prob( [mu_a, sigma_a, a, b, sigma_y, log_radon])
@tf.function def sample_varying_intercepts(num_chains, num_results, num_burnin_steps): """Samples from the varying intercepts model.""" hmc = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=varying_intercept_log_prob, num_leapfrog_steps=10, step_size=0.01) initial_state = [ tf.zeros([num_chains], name='init_mu_a'), tf.ones([num_chains], name='init_sigma_a'), tf.zeros([num_chains, num_counties], name='init_a'), tf.zeros([num_chains], name='init_b'), tf.ones([num_chains], name='init_sigma_y') ] unconstraining_bijectors = [ tfb.Identity(), # mu_a tfb.Exp(), # sigma_a tfb.Identity(), # a tfb.Identity(), # b tfb.Exp() # sigma_y ] kernel = tfp.mcmc.TransformedTransitionKernel( inner_kernel=hmc, bijector=unconstraining_bijectors) samples, kernel_results = tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=initial_state, kernel=kernel) acceptance_probs = tf.reduce_mean( tf.cast(kernel_results.inner_results.is_accepted, tf.float32), axis=0) return samples, acceptance_probs
VaryingInterceptsModel = collections.namedtuple( 'VaryingInterceptsModel', ['mu_a', 'sigma_a', 'a', 'b', 'sigma_y']) samples, acceptance_probs = sample_varying_intercepts( num_chains=4, num_results=1000, num_burnin_steps=1000) print('Acceptance Probabilities: ', acceptance_probs.numpy()) varying_intercepts_samples = VaryingInterceptsModel._make(samples)
Probabilidades de aceitação: [0.989 0.98 0.988 0.983]
for var in ['mu_a', 'sigma_a', 'b', 'sigma_y']: print( 'R-hat for ', var, ': ', tfp.mcmc.potential_scale_reduction( getattr(varying_intercepts_samples, var)).numpy())
R-hat para mu_a : 1.0196627 R-hat para sigma_a : 1.0671698 R-hat para b : 1.0017126 R-hat para sigma_y : 0.99950683
varying_intercepts_estimates = LinearEstimates( sample_mean(varying_intercepts_samples.a), sample_mean(varying_intercepts_samples.b)) sample_counties = ('Lac Qui Parle', 'Aitkin', 'Koochiching', 'Douglas', 'Clay', 'Stearns', 'Ramsey', 'St Louis') plot_estimates( linear_estimates=[ unpooled_estimates, pooled_estimate, varying_intercepts_estimates ], labels=['Unpooled', 'Pooled', 'Varying Intercepts'], sample_counties=sample_counties)
Image in a Jupyter notebook
def plot_posterior(var_name, var_samples): if isinstance(var_samples, tf.Tensor): var_samples = var_samples.numpy() # convert to numpy array fig = plt.figure(figsize=(10, 3)) ax = fig.add_subplot(111) ax.hist(var_samples.flatten(), bins=40, edgecolor='white') sample_mean = var_samples.mean() ax.text( sample_mean, 100, 'mean={:.3f}'.format(sample_mean), color='white', fontsize=12) ax.set_xlabel('posterior of ' + var_name) plt.show() plot_posterior('b', varying_intercepts_samples.b) plot_posterior('sigma_a', varying_intercepts_samples.sigma_a)
Image in a Jupyter notebookImage in a Jupyter notebook

A estimativa para o coeficiente de andar é aproximadamente -0,69, o que pode ser interpretado como casas sem porão tendo cerca de metade (exp(0.69)=0.50\exp(-0.69) = 0.50) dos níveis de radônio do que as que têm porão, após levar em conta o condado.

for var in ['b']: var_samples = getattr(varying_intercepts_samples, var) mean = var_samples.numpy().mean() std = var_samples.numpy().std() r_hat = tfp.mcmc.potential_scale_reduction(var_samples).numpy() n_eff = tfp.mcmc.effective_sample_size(var_samples).numpy().sum() print('var: ', var, ' mean: ', mean, ' std: ', std, ' n_eff: ', n_eff, ' r_hat: ', r_hat)
var: b mean: -0.6920927 std: 0.07004689 n_eff: 430.58865 r_hat: 1.0017126
def plot_intercepts_and_slopes(linear_estimates, title): xvals = np.arange(2) intercepts = np.ones([num_counties]) * linear_estimates.intercept slopes = np.ones([num_counties]) * linear_estimates.slope fig, ax = plt.subplots() for c in range(num_counties): ax.plot(xvals, intercepts[c] + slopes[c] * xvals, 'bo-', alpha=0.4) plt.xlim(-0.2, 1.2) ax.set_xticks([0, 1]) ax.set_xticklabels(['Basement', 'First Floor']) ax.set_ylabel('Log Radon level') plt.title(title) plt.show()
plot_intercepts_and_slopes(varying_intercepts_estimates, 'Log Radon Estimates (Varying Intercepts)')
Image in a Jupyter notebook

5.3 – Variação dos declives

Alternativamente, podemos postular um modelo que permita que os condados variem dependendo de como o local da medição (porão ou térreo) influencia na medição de radônio. Neste caso, o intercepto α\alpha é compartilhado entre os condados.

yi=α+βj[i]xi+ϵiy_i = \alpha + \beta_{j[i]} x_{i} + \epsilon_i
#@title { display-mode: "form" } mpl.rc("font", size=12) pgm = daft.PGM([10, 4.5], node_unit=1.2) pgm.add_node( daft.Node( "mu_b_prior", r"$\mathcal{N}(0, 10^5)$", 3.2, 4, fixed=True, offset=(0, 5))) pgm.add_node( daft.Node( "a_prior", r"$\mathcal{N}(0, 10^5)$", 2, 3, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("a", r"$a$", 2, 2)) pgm.add_node( daft.Node( "sigma_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 4, 3.5, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("sigma_y", r"$\sigma_y$", 4, 2.5)) pgm.add_node( daft.Node( "mu_b_prior", r"$\mathcal{N}(0, 10^5)$", 5, 4, fixed=True, offset=(0, 5))) pgm.add_node( daft.Node( "sigma_b_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 7, 4, fixed=True, offset=(10, 5))) pgm.add_node(daft.Node("mu_b", r"$\mu_b$", 5, 3)) pgm.add_node(daft.Node("sigma_b", r"$\sigma_b$", 7, 3)) pgm.add_node(daft.Node("b", r"$b \sim \mathcal{N}$", 6, 2, scale=1.25)) pgm.add_node( daft.Node( "y_i", r"$y_i \sim \mathcal{N}$", 4, 1, scale=1.25, observed=True)) pgm.add_edge("a_prior", "a") pgm.add_edge("mu_b_prior", "mu_b") pgm.add_edge("sigma_b_prior", "sigma_b") pgm.add_edge("mu_b", "b") pgm.add_edge("sigma_b", "b") pgm.add_edge("sigma_prior", "sigma_y") pgm.add_edge("sigma_y", "y_i") pgm.add_edge("a", "y_i") pgm.add_edge("b", "y_i") pgm.add_plate(daft.Plate([5.4, 1.2, 1.2, 1.4], "$i = 1:85$")) pgm.add_plate(daft.Plate([3.4, 0.2, 1.2, 1.4], "$i = 1:919$")) pgm.render() plt.show()
Image in a Jupyter notebook
def varying_slopes_model(floor, county): """Creates a joint distribution for the varying slopes model.""" return tfd.JointDistributionSequential([ tfd.Normal(loc=0., scale=1e5), # mu_b tfd.HalfCauchy(loc=0., scale=5), # sigma_b tfd.Normal(loc=0., scale=1e5), # a lambda _, sigma_b, mu_b: tfd.MultivariateNormalDiag( # b loc=affine(tf.ones([num_counties]), mu_b[..., tf.newaxis]), scale_identity_multiplier=sigma_b), tfd.HalfCauchy(loc=0., scale=5), # sigma_y lambda sigma_y, b, a: tfd.MultivariateNormalDiag( # y loc=affine(floor, tf.gather(b, county, axis=-1), a[..., tf.newaxis]), scale_identity_multiplier=sigma_y) ]) def varying_slopes_log_prob(mu_b, sigma_b, a, b, sigma_y): return varying_slopes_model(floor, county).log_prob( [mu_b, sigma_b, a, b, sigma_y, log_radon])
@tf.function def sample_varying_slopes(num_chains, num_results, num_burnin_steps): """Samples from the varying slopes model.""" hmc = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=varying_slopes_log_prob, num_leapfrog_steps=25, step_size=0.01) initial_state = [ tf.zeros([num_chains], name='init_mu_b'), tf.ones([num_chains], name='init_sigma_b'), tf.zeros([num_chains], name='init_a'), tf.zeros([num_chains, num_counties], name='init_b'), tf.ones([num_chains], name='init_sigma_y') ] unconstraining_bijectors = [ tfb.Identity(), # mu_b tfb.Exp(), # sigma_b tfb.Identity(), # a tfb.Identity(), # b tfb.Exp() # sigma_y ] kernel = tfp.mcmc.TransformedTransitionKernel( inner_kernel=hmc, bijector=unconstraining_bijectors) samples, kernel_results = tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=initial_state, kernel=kernel) acceptance_probs = tf.reduce_mean( tf.cast(kernel_results.inner_results.is_accepted, tf.float32), axis=0) return samples, acceptance_probs
VaryingSlopesModel = collections.namedtuple( 'VaryingSlopesModel', ['mu_b', 'sigma_b', 'a', 'b', 'sigma_y']) samples, acceptance_probs = sample_varying_slopes( num_chains=4, num_results=1000, num_burnin_steps=1000) print('Acceptance Probabilities: ', acceptance_probs.numpy()) varying_slopes_samples = VaryingSlopesModel._make(samples)
Probabilidades de aceitação: [0.98 0.982 0.986 0.988]
for var in ['mu_b', 'sigma_b', 'a', 'sigma_y']: print( 'R-hat for ', var, '\t: ', tfp.mcmc.potential_scale_reduction(getattr(varying_slopes_samples, var)).numpy())
R-hat para mu_b : 1.0972525 R-hat para sigma_b : 1.1294962 R-hat para a : 1.0047072 R-hat para sigma_y : 1.0015919
varying_slopes_estimates = LinearEstimates( sample_mean(varying_slopes_samples.a), sample_mean(varying_slopes_samples.b)) plot_intercepts_and_slopes(varying_slopes_estimates, 'Log Radon Estimates (Varying Slopes)')
Image in a Jupyter notebook

5.4 – Variação dos interceptos e declives

O modelo mais geral permite que tanto o intercepto quanto o declive variem por condado:

yi=αj[i]+βj[i]xi+ϵiy_i = \alpha_{j[i]} + \beta_{j[i]} x_{i} + \epsilon_i
#@title mpl.rc("font", size=12) pgm = daft.PGM([10, 4.5], node_unit=1.2) pgm.add_node( daft.Node( "mu_a_prior", r"$\mathcal{N}(0, 10^5)$", 1, 4, fixed=True, offset=(0, 5))) pgm.add_node( daft.Node( "sigma_a_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 3, 4, fixed=True, offset=(10, 5))) pgm.add_node(daft.Node("mu_a", r"$\mu_a$", 1, 3)) pgm.add_node(daft.Node("sigma_a", r"$\sigma_a$", 3, 3)) pgm.add_node(daft.Node("a", r"$a \sim \mathcal{N}$", 2, 2, scale=1.25)) pgm.add_node( daft.Node( "sigma_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 4, 3.5, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("sigma_y", r"$\sigma_y$", 4, 2.5)) pgm.add_node( daft.Node( "mu_b_prior", r"$\mathcal{N}(0, 10^5)$", 5, 4, fixed=True, offset=(0, 5))) pgm.add_node( daft.Node( "sigma_b_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 7, 4, fixed=True, offset=(10, 5))) pgm.add_node(daft.Node("mu_b", r"$\mu_b$", 5, 3)) pgm.add_node(daft.Node("sigma_b", r"$\sigma_b$", 7, 3)) pgm.add_node(daft.Node("b", r"$b \sim \mathcal{N}$", 6, 2, scale=1.25)) pgm.add_node( daft.Node( "y_i", r"$y_i \sim \mathcal{N}$", 4, 1, scale=1.25, observed=True)) pgm.add_edge("mu_a_prior", "mu_a") pgm.add_edge("sigma_a_prior", "sigma_a") pgm.add_edge("mu_a", "a") pgm.add_edge("sigma_a", "a") pgm.add_edge("mu_b_prior", "mu_b") pgm.add_edge("sigma_b_prior", "sigma_b") pgm.add_edge("mu_b", "b") pgm.add_edge("sigma_b", "b") pgm.add_edge("sigma_prior", "sigma_y") pgm.add_edge("sigma_y", "y_i") pgm.add_edge("a", "y_i") pgm.add_edge("b", "y_i") pgm.add_plate(daft.Plate([1.4, 1.2, 1.2, 1.4], "$i = 1:85$")) pgm.add_plate(daft.Plate([5.4, 1.2, 1.2, 1.4], "$i = 1:85$")) pgm.add_plate(daft.Plate([3.4, 0.2, 1.2, 1.4], "$i = 1:919$")) pgm.render() plt.show()
Image in a Jupyter notebook
def varying_intercepts_and_slopes_model(floor, county): """Creates a joint distribution for the varying slope model.""" return tfd.JointDistributionSequential([ tfd.Normal(loc=0., scale=1e5), # mu_a tfd.HalfCauchy(loc=0., scale=5), # sigma_a tfd.Normal(loc=0., scale=1e5), # mu_b tfd.HalfCauchy(loc=0., scale=5), # sigma_b lambda sigma_b, mu_b, sigma_a, mu_a: tfd.MultivariateNormalDiag( # a loc=affine(tf.ones([num_counties]), mu_a[..., tf.newaxis]), scale_identity_multiplier=sigma_a), lambda _, sigma_b, mu_b: tfd.MultivariateNormalDiag( # b loc=affine(tf.ones([num_counties]), mu_b[..., tf.newaxis]), scale_identity_multiplier=sigma_b), tfd.HalfCauchy(loc=0., scale=5), # sigma_y lambda sigma_y, b, a: tfd.MultivariateNormalDiag( # y loc=affine(floor, tf.gather(b, county, axis=-1), tf.gather(a, county, axis=-1)), scale_identity_multiplier=sigma_y) ]) @tf.function def varying_intercepts_and_slopes_log_prob(mu_a, sigma_a, mu_b, sigma_b, a, b, sigma_y): """Computes joint log prob pinned at `log_radon`.""" return varying_intercepts_and_slopes_model(floor, county).log_prob( [mu_a, sigma_a, mu_b, sigma_b, a, b, sigma_y, log_radon])
@tf.function def sample_varying_intercepts_and_slopes(num_chains, num_results, num_burnin_steps): """Samples from the varying intercepts and slopes model.""" hmc = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=varying_intercepts_and_slopes_log_prob, num_leapfrog_steps=50, step_size=0.01) initial_state = [ tf.zeros([num_chains], name='init_mu_a'), tf.ones([num_chains], name='init_sigma_a'), tf.zeros([num_chains], name='init_mu_b'), tf.ones([num_chains], name='init_sigma_b'), tf.zeros([num_chains, num_counties], name='init_a'), tf.zeros([num_chains, num_counties], name='init_b'), tf.ones([num_chains], name='init_sigma_y') ] unconstraining_bijectors = [ tfb.Identity(), # mu_a tfb.Exp(), # sigma_a tfb.Identity(), # mu_b tfb.Exp(), # sigma_b tfb.Identity(), # a tfb.Identity(), # b tfb.Exp() # sigma_y ] kernel = tfp.mcmc.TransformedTransitionKernel( inner_kernel=hmc, bijector=unconstraining_bijectors) samples, kernel_results = tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=initial_state, kernel=kernel) acceptance_probs = tf.reduce_mean( tf.cast(kernel_results.inner_results.is_accepted, tf.float32), axis=0) return samples, acceptance_probs
VaryingInterceptsAndSlopesModel = collections.namedtuple( 'VaryingInterceptsAndSlopesModel', ['mu_a', 'sigma_a', 'mu_b', 'sigma_b', 'a', 'b', 'sigma_y']) samples, acceptance_probs = sample_varying_intercepts_and_slopes( num_chains=4, num_results=1000, num_burnin_steps=500) print('Acceptance Probabilities: ', acceptance_probs.numpy()) varying_intercepts_and_slopes_samples = VaryingInterceptsAndSlopesModel._make( samples)
Probabilidades de aceitação: [0.989 0.958 0.984 0.985]
for var in ['mu_a', 'sigma_a', 'mu_b', 'sigma_b']: print( 'R-hat for ', var, '\t: ', tfp.mcmc.potential_scale_reduction( getattr(varying_intercepts_and_slopes_samples, var)).numpy())
R-hat para mu_a : 1.0002819 R-hat para sigma_a : 1.0014255 R-hat para mu_b : 1.0111941 R-hat para sigma_b : 1.0994663
varying_intercepts_and_slopes_estimates = LinearEstimates( sample_mean(varying_intercepts_and_slopes_samples.a), sample_mean(varying_intercepts_and_slopes_samples.b)) plot_intercepts_and_slopes( varying_intercepts_and_slopes_estimates, 'Log Radon Estimates (Varying Intercepts and Slopes)')
Image in a Jupyter notebook
forest_plot( num_chains=4, num_vars=num_counties, var_name='a', var_labels=county_name, samples=varying_intercepts_and_slopes_samples.a.numpy()) forest_plot( num_chains=4, num_vars=num_counties, var_name='b', var_labels=county_name, samples=varying_intercepts_and_slopes_samples.b.numpy())
Image in a Jupyter notebookImage in a Jupyter notebook

6 – Inclusão de preditores em nível de grupo

A força principal dos modelos multiníveis é a capacidade de tratar os preditores em múltiplos níveis simultaneamente. Se considerarmos o modelo com variação de interceptos acima:

yi=αj[i]+βxi+ϵiy_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_iαj=γ0+γ1uj+ζj\alpha_j = \gamma_0 + \gamma_1 u_j + \zeta_j

\sigma_{\alpha}^2)$$. Portanto, agora estamos incorporando um preditor no nível de casa (andares ou porão), bem como um preditor no nível de condado (urânio).

Observe que o modelo tem as duas variáveis indicadoras para cada condado, além de uma covariável no nível de condado. Na regressão clássica, isso resultaria em colinearidade. Em um modelo multinível, o pooling parcial dos interceptos em direção ao valor esperado do modelo linear no nível de grupo evita isso.

Os preditores no nível de grupo também servem para reduzir a variação no nível de grupo σα\sigma_{\alpha}. Uma implicação importante é que a estimativa em nível de grupo induz um pooling mais forte.

#@title 6.1 Hierarchical Intercepts model { display-mode: "form" } mpl.rc("font", size=12) pgm = daft.PGM([10, 4.5], node_unit=1.2) pgm.add_node( daft.Node( "gamma_0_prior", r"$\mathcal{N}(0, 10^5)$", 0.5, 4, fixed=True, offset=(0, 5))) pgm.add_node( daft.Node( "gamma_1_prior", r"$\mathcal{N}(0, 10^5)$", 1.5, 4, fixed=True, offset=(10, 5))) pgm.add_node(daft.Node("gamma_0", r"$\gamma_0$", 0.5, 3)) pgm.add_node(daft.Node("gamma_1", r"$\gamma_1$", 1.5, 3)) pgm.add_node( daft.Node( "sigma_a_prior", r"$\mathcal{N}(0, 10^5)$", 3, 4, fixed=True, offset=(0, 5))) pgm.add_node(daft.Node("sigma_a", r"$\sigma_a$", 3, 3.5)) pgm.add_node(daft.Node("eps_a", r"$eps_a$", 3, 2.5, scale=1.25)) pgm.add_node(daft.Node("a", r"$a \sim \mathcal{Det}$", 1.5, 1.2, scale=1.5)) pgm.add_node( daft.Node( "sigma_prior", r"$\mathrm{U}(0, 100)$", 4, 3.5, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("sigma_y", r"$\sigma_y$", 4, 2.5)) pgm.add_node(daft.Node("b_prior", r"$\mathcal{N}(0, 10^5)$", 5, 3, fixed=True)) pgm.add_node(daft.Node("b", r"$b$", 5, 2)) pgm.add_node( daft.Node( "y_i", r"$y_i \sim \mathcal{N}$", 4, 1, scale=1.25, observed=True)) pgm.add_edge("gamma_0_prior", "gamma_0") pgm.add_edge("gamma_1_prior", "gamma_1") pgm.add_edge("sigma_a_prior", "sigma_a") pgm.add_edge("sigma_a", "eps_a") pgm.add_edge("gamma_0", "a") pgm.add_edge("gamma_1", "a") pgm.add_edge("eps_a", "a") pgm.add_edge("b_prior", "b") pgm.add_edge("sigma_prior", "sigma_y") pgm.add_edge("sigma_y", "y_i") pgm.add_edge("a", "y_i") pgm.add_edge("b", "y_i") pgm.add_plate(daft.Plate([2.4, 1.7, 1.2, 1.4], "$i = 1:85$")) pgm.add_plate(daft.Plate([0.9, 0.4, 1.2, 1.4], "$i = 1:919$")) pgm.add_plate(daft.Plate([3.4, 0.2, 1.2, 1.4], "$i = 1:919$")) pgm.render() plt.show()
Image in a Jupyter notebook
def hierarchical_intercepts_model(floor, county, log_uranium): """Creates a joint distribution for the varying slope model.""" return tfd.JointDistributionSequential([ tfd.HalfCauchy(loc=0., scale=5), # sigma_a lambda sigma_a: tfd.MultivariateNormalDiag( # eps_a loc=tf.zeros([num_counties]), scale_identity_multiplier=sigma_a), tfd.Normal(loc=0., scale=1e5), # gamma_0 tfd.Normal(loc=0., scale=1e5), # gamma_1 tfd.Normal(loc=0., scale=1e5), # b tfd.Uniform(low=0., high=100), # sigma_y lambda sigma_y, b, gamma_1, gamma_0, eps_a: tfd. MultivariateNormalDiag( # y loc=affine( floor, b[..., tf.newaxis], affine(log_uranium, gamma_1[..., tf.newaxis], gamma_0[..., tf.newaxis]) + tf.gather(eps_a, county, axis=-1)), scale_identity_multiplier=sigma_y) ]) def hierarchical_intercepts_log_prob(sigma_a, eps_a, gamma_0, gamma_1, b, sigma_y): """Computes joint log prob pinned at `log_radon`.""" return hierarchical_intercepts_model(floor, county, log_uranium).log_prob( [sigma_a, eps_a, gamma_0, gamma_1, b, sigma_y, log_radon])
@tf.function def sample_hierarchical_intercepts(num_chains, num_results, num_burnin_steps): """Samples from the hierarchical intercepts model.""" hmc = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=hierarchical_intercepts_log_prob, num_leapfrog_steps=10, step_size=0.01) initial_state = [ tf.ones([num_chains], name='init_sigma_a'), tf.zeros([num_chains, num_counties], name='eps_a'), tf.zeros([num_chains], name='init_gamma_0'), tf.zeros([num_chains], name='init_gamma_1'), tf.zeros([num_chains], name='init_b'), tf.ones([num_chains], name='init_sigma_y') ] unconstraining_bijectors = [ tfb.Exp(), # sigma_a tfb.Identity(), # eps_a tfb.Identity(), # gamma_0 tfb.Identity(), # gamma_0 tfb.Identity(), # b # Maps reals to [0, 100]. tfb.Chain([tfb.Shift(shift=50.), tfb.Scale(scale=50.), tfb.Tanh()]) # sigma_y ] kernel = tfp.mcmc.TransformedTransitionKernel( inner_kernel=hmc, bijector=unconstraining_bijectors) samples, kernel_results = tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=initial_state, kernel=kernel) acceptance_probs = tf.reduce_mean( tf.cast(kernel_results.inner_results.is_accepted, tf.float32), axis=0) return samples, acceptance_probs
HierarchicalInterceptsModel = collections.namedtuple( 'HierarchicalInterceptsModel', ['sigma_a', 'eps_a', 'gamma_0', 'gamma_1', 'b', 'sigma_y']) samples, acceptance_probs = sample_hierarchical_intercepts( num_chains=4, num_results=2000, num_burnin_steps=500) print('Acceptance Probabilities: ', acceptance_probs.numpy()) hierarchical_intercepts_samples = HierarchicalInterceptsModel._make(samples)
Probabilidades de aceitação: [0.956 0.959 0.9675 0.958 ]
for var in ['sigma_a', 'gamma_0', 'gamma_1', 'b', 'sigma_y']: print( 'R-hat for', var, ':', tfp.mcmc.potential_scale_reduction( getattr(hierarchical_intercepts_samples, var)).numpy())
R-hat para sigma_a : 1.0204408 R-hat para gamma_0 : 1.0075455 R-hat para gamma_1 : 1.0054599 R-hat para b : 1.0011046 R-hat para sigma_y : 1.0004083
def plot_hierarchical_intercepts(): mean_and_var = lambda x : [reduce_samples(x, fn) for fn in [np.mean, np.var]] gamma_0_mean, gamma_0_var = mean_and_var( hierarchical_intercepts_samples.gamma_0) gamma_1_mean, gamma_1_var = mean_and_var( hierarchical_intercepts_samples.gamma_1) eps_a_means, eps_a_vars = mean_and_var(hierarchical_intercepts_samples.eps_a) mu_a_means = gamma_0_mean + gamma_1_mean * log_uranium mu_a_vars = gamma_0_var + np.square(log_uranium) * gamma_1_var a_means = mu_a_means + eps_a_means[county] a_stds = np.sqrt(mu_a_vars + eps_a_vars[county]) plt.figure() plt.scatter(log_uranium, a_means, marker='.', c='C0') xvals = np.linspace(-1, 0.8) plt.plot(xvals,gamma_0_mean + gamma_1_mean * xvals, 'k--') plt.xlim(-1, 0.8) for ui, m, se in zip(log_uranium, a_means, a_stds): plt.plot([ui, ui], [m - se, m + se], 'C1-', alpha=0.1) plt.xlabel('County-level uranium') plt.ylabel('Intercept estimate') plot_hierarchical_intercepts()
Image in a Jupyter notebook

Os erros padrão dos interceptos são mais estreitos do que para o modelo com pooling parcial sem uma covariável no nível de condado.

6.2 – Correlações entre níveis

Em alguns casos, ter preditores em diversos níveis pode revelar a correlação entre as variáveis em nível individual e os residuais de grupos. Podemos levar isso em conta incluindo a média dos preditores individuais como uma covariável no modelo para o intercepto do grupo.

αj=γ0+γ1uj+γ2xˉ+ζj\alpha_j = \gamma_0 + \gamma_1 u_j + \gamma_2 \bar{x} + \zeta_j
#@title { display-mode: "form" } mpl.rc("font", size=12) pgm = daft.PGM([10, 4.5], node_unit=1.2) pgm.add_node( daft.Node( "gamma_prior", r"$\mathcal{N}(0, 10^5)$", 1.5, 4, fixed=True, offset=(10, 5))) pgm.add_node(daft.Node("gamma", r"$\gamma$", 1.5, 3.5)) pgm.add_node(daft.Node("mu_a", r"$\mu_a$", 1.5, 2.2)) pgm.add_node( daft.Node( "sigma_a_prior", r"$\mathrm{HalfCauchy}(0, 5)$", 3, 4, fixed=True, offset=(0, 5))) pgm.add_node(daft.Node("sigma_a", r"$\sigma_a$", 3, 3.5)) pgm.add_node(daft.Node("eps_a", r"$eps_a$", 3, 2.5, scale=1.25)) pgm.add_node(daft.Node("a", r"$a \sim \mathcal{Det}$", 1.5, 1.2, scale=1.5)) pgm.add_node( daft.Node( "sigma_prior", r"$\mathrm{U}(0, 100)$", 4, 3.5, fixed=True, offset=(20, 5))) pgm.add_node(daft.Node("sigma_y", r"$\sigma_y$", 4, 2.5)) pgm.add_node(daft.Node("b_prior", r"$\mathcal{N}(0, 10^5)$", 5, 3, fixed=True)) pgm.add_node(daft.Node("b", r"$b$", 5, 2)) pgm.add_node( daft.Node( "y_i", r"$y_i \sim \mathcal{N}$", 4, 1, scale=1.25, observed=True)) pgm.add_edge("gamma_prior", "gamma") pgm.add_edge("sigma_a_prior", "sigma_a") pgm.add_edge("sigma_a", "eps_a") pgm.add_edge("gamma", "mu_a") pgm.add_edge("mu_a", "a") pgm.add_edge("eps_a", "a") pgm.add_edge("b_prior", "b") pgm.add_edge("sigma_prior", "sigma_y") pgm.add_edge("sigma_y", "y_i") pgm.add_edge("a", "y_i") pgm.add_edge("b", "y_i") pgm.add_plate(daft.Plate([0.9, 2.9, 1.2, 1.0], "$i = 1:3$")) pgm.add_plate(daft.Plate([2.4, 1.7, 1.2, 1.4], "$i = 1:85$")) pgm.add_plate(daft.Plate([0.9, 0.4, 1.2, 2.2], "$i = 1:919$")) pgm.add_plate(daft.Plate([3.4, 0.2, 1.2, 1.4], "$i = 1:919$")) pgm.render() plt.show()
Image in a Jupyter notebook
# Create a new variable for mean of floor across counties xbar = tf.convert_to_tensor(radon.groupby('county')['floor'].mean(), tf.float32) xbar = tf.gather(xbar, county, axis=-1)
def contextual_effects_model(floor, county, log_uranium, xbar): """Creates a joint distribution for the varying slope model.""" return tfd.JointDistributionSequential([ tfd.HalfCauchy(loc=0., scale=5), # sigma_a lambda sigma_a: tfd.MultivariateNormalDiag( # eps_a loc=tf.zeros([num_counties]), scale_diag=sigma_a[..., tf.newaxis] * tf.ones([num_counties])), tfd.Normal(loc=0., scale=1e5), # gamma_0 tfd.Normal(loc=0., scale=1e5), # gamma_1 tfd.Normal(loc=0., scale=1e5), # gamma_2 tfd.Normal(loc=0., scale=1e5), # b tfd.Uniform(low=0., high=100), # sigma_y lambda sigma_y, b, gamma_2, gamma_1, gamma_0, eps_a: tfd. MultivariateNormalDiag( # y loc=affine( floor, b[..., tf.newaxis], affine(log_uranium, gamma_1[..., tf.newaxis], gamma_0[ ..., tf.newaxis]) + affine(xbar, gamma_2[..., tf.newaxis]) + tf.gather(eps_a, county, axis=-1)), scale_diag=sigma_y[..., tf.newaxis] * tf.ones_like(xbar)) ]) def contextual_effects_log_prob(sigma_a, eps_a, gamma_0, gamma_1, gamma_2, b, sigma_y): """Computes joint log prob pinned at `log_radon`.""" return contextual_effects_model(floor, county, log_uranium, xbar).log_prob( [sigma_a, eps_a, gamma_0, gamma_1, gamma_2, b, sigma_y, log_radon])
@tf.function def sample_contextual_effects(num_chains, num_results, num_burnin_steps): """Samples from the hierarchical intercepts model.""" hmc = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=contextual_effects_log_prob, num_leapfrog_steps=10, step_size=0.01) initial_state = [ tf.ones([num_chains], name='init_sigma_a'), tf.zeros([num_chains, num_counties], name='eps_a'), tf.zeros([num_chains], name='init_gamma_0'), tf.zeros([num_chains], name='init_gamma_1'), tf.zeros([num_chains], name='init_gamma_2'), tf.zeros([num_chains], name='init_b'), tf.ones([num_chains], name='init_sigma_y') ] unconstraining_bijectors = [ tfb.Exp(), # sigma_a tfb.Identity(), # eps_a tfb.Identity(), # gamma_0 tfb.Identity(), # gamma_1 tfb.Identity(), # gamma_2 tfb.Identity(), # b tfb.Chain([tfb.Shift(shift=50.), tfb.Scale(scale=50.), tfb.Tanh()]) # sigma_y ] kernel = tfp.mcmc.TransformedTransitionKernel( inner_kernel=hmc, bijector=unconstraining_bijectors) samples, kernel_results = tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=initial_state, kernel=kernel) acceptance_probs = tf.reduce_mean( tf.cast(kernel_results.inner_results.is_accepted, tf.float32), axis=0) return samples, acceptance_probs
ContextualEffectsModel = collections.namedtuple( 'ContextualEffectsModel', ['sigma_a', 'eps_a', 'gamma_0', 'gamma_1', 'gamma_2', 'b', 'sigma_y']) samples, acceptance_probs = sample_contextual_effects( num_chains=4, num_results=2000, num_burnin_steps=500) print('Acceptance Probabilities: ', acceptance_probs.numpy()) contextual_effects_samples = ContextualEffectsModel._make(samples)
Probabilidades de aceitação: [0.948 0.952 0.956 0.953]
for var in ['sigma_a', 'gamma_0', 'gamma_1', 'gamma_2', 'b', 'sigma_y']: print( 'R-hat for ', var, ': ', tfp.mcmc.potential_scale_reduction( getattr(contextual_effects_samples, var)).numpy())
R-hat para sigma_a : 1.1393573 R-hat para gamma_0 : 1.0081229 R-hat para gamma_1 : 1.0007668 R-hat para gamma_2 : 1.012864 R-hat para b : 1.0019505 R-hat para sigma_y : 1.0056173
for var in ['gamma_0', 'gamma_1', 'gamma_2']: var_samples = getattr(contextual_effects_samples, var) mean = var_samples.numpy().mean() std = var_samples.numpy().std() r_hat = tfp.mcmc.potential_scale_reduction(var_samples).numpy() n_eff = tfp.mcmc.effective_sample_size(var_samples).numpy().sum() print(var, ' mean: ', mean, ' std: ', std, ' n_eff: ', n_eff, ' r_hat: ', r_hat)
gamma_0 mean: 1.3939122 std: 0.051875897 n_eff: 572.4374 r_hat: 1.0081229 gamma_1 mean: 0.7207277 std: 0.090660274 n_eff: 727.2628 r_hat: 1.0007668 gamma_2 mean: 0.40686083 std: 0.20155264 n_eff: 381.74048 r_hat: 1.012864

Portanto, poderíamos inferir disso que os condados com proporções maiores de casas sem porão costumam ter níveis maiores de referência de radônio. Talvez isso esteja relacionado ao tipo de solo, que, por sua vez, pode influenciar que tipo de estruturas são construídas.

6.3 – Previsão

Gelman (2006) usou testes de validação cruzada para verificar o erro de previsão dos modelos sem pooling, com pooling total e com pooling parcial.

Raiz quadrada média dos erros de previsão com validação cruzada:

  • Sem pooling = 0,86

  • Com pooling = 0,84

  • Multinível = 0,79

É possível fazer dois tipos de previsão em um modelo multinível:

  1. Um novo indivíduo de um grupo existente.

  2. Um novo indivíduo de um novo grupo.

Por exemplo: se quiséssemos fazer uma previsão para uma nova casa sem porão no condado de St. Louis, precisaríamos apenas amostrar a partir do modelo de radônio com o intercepto apropriado.

county_name.index('St Louis')
69

Ou seja:

y~iN(α69+β(xi=1),σy2)\tilde{y}_i \sim N(\alpha_{69} + \beta (x_i=1), \sigma_y^2)
st_louis_log_uranium = tf.convert_to_tensor( radon.where(radon['county'] == 69)['log_uranium_ppm'].mean(), tf.float32) st_louis_xbar = tf.convert_to_tensor( radon.where(radon['county'] == 69)['floor'].mean(), tf.float32)
@tf.function def intercept_a(gamma_0, gamma_1, gamma_2, eps_a, log_uranium, xbar, county): return (affine(log_uranium, gamma_1, gamma_0) + affine(xbar, gamma_2) + tf.gather(eps_a, county, axis=-1)) def contextual_effects_predictive_model(floor, county, log_uranium, xbar, st_louis_log_uranium, st_louis_xbar): """Creates a joint distribution for the contextual effects model.""" return tfd.JointDistributionSequential([ tfd.HalfCauchy(loc=0., scale=5), # sigma_a lambda sigma_a: tfd.MultivariateNormalDiag( # eps_a loc=tf.zeros([num_counties]), scale_diag=sigma_a[..., tf.newaxis] * tf.ones([num_counties])), tfd.Normal(loc=0., scale=1e5), # gamma_0 tfd.Normal(loc=0., scale=1e5), # gamma_1 tfd.Normal(loc=0., scale=1e5), # gamma_2 tfd.Normal(loc=0., scale=1e5), # b tfd.Uniform(low=0., high=100), # sigma_y # y lambda sigma_y, b, gamma_2, gamma_1, gamma_0, eps_a: ( tfd.MultivariateNormalDiag( loc=affine( floor, b[..., tf.newaxis], intercept_a(gamma_0[..., tf.newaxis], gamma_1[..., tf.newaxis], gamma_2[..., tf.newaxis], eps_a, log_uranium, xbar, county)), scale_diag=sigma_y[..., tf.newaxis] * tf.ones_like(xbar))), # stl_pred lambda _, sigma_y, b, gamma_2, gamma_1, gamma_0, eps_a: tfd.Normal( loc=intercept_a(gamma_0, gamma_1, gamma_2, eps_a, st_louis_log_uranium, st_louis_xbar, 69) + b, scale=sigma_y) ]) @tf.function def contextual_effects_predictive_log_prob(sigma_a, eps_a, gamma_0, gamma_1, gamma_2, b, sigma_y, stl_pred): """Computes joint log prob pinned at `log_radon`.""" return contextual_effects_predictive_model(floor, county, log_uranium, xbar, st_louis_log_uranium, st_louis_xbar).log_prob([ sigma_a, eps_a, gamma_0, gamma_1, gamma_2, b, sigma_y, log_radon, stl_pred ])
@tf.function def sample_contextual_effects_predictive(num_chains, num_results, num_burnin_steps): """Samples from the contextual effects predictive model.""" hmc = tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=contextual_effects_predictive_log_prob, num_leapfrog_steps=50, step_size=0.01) initial_state = [ tf.ones([num_chains], name='init_sigma_a'), tf.zeros([num_chains, num_counties], name='eps_a'), tf.zeros([num_chains], name='init_gamma_0'), tf.zeros([num_chains], name='init_gamma_1'), tf.zeros([num_chains], name='init_gamma_2'), tf.zeros([num_chains], name='init_b'), tf.ones([num_chains], name='init_sigma_y'), tf.zeros([num_chains], name='init_stl_pred') ] unconstraining_bijectors = [ tfb.Exp(), # sigma_a tfb.Identity(), # eps_a tfb.Identity(), # gamma_0 tfb.Identity(), # gamma_1 tfb.Identity(), # gamma_2 tfb.Identity(), # b tfb.Chain([tfb.Shift(shift=50.), tfb.Scale(scale=50.), tfb.Tanh()]), # sigma_y tfb.Identity(), # stl_pred ] kernel = tfp.mcmc.TransformedTransitionKernel( inner_kernel=hmc, bijector=unconstraining_bijectors) samples, kernel_results = tfp.mcmc.sample_chain( num_results=num_results, num_burnin_steps=num_burnin_steps, current_state=initial_state, kernel=kernel) acceptance_probs = tf.reduce_mean( tf.cast(kernel_results.inner_results.is_accepted, tf.float32), axis=0) return samples, acceptance_probs
ContextualEffectsPredictiveModel = collections.namedtuple( 'ContextualEffectsPredictiveModel', [ 'sigma_a', 'eps_a', 'gamma_0', 'gamma_1', 'gamma_2', 'b', 'sigma_y', 'stl_pred' ]) samples, acceptance_probs = sample_contextual_effects_predictive( num_chains=4, num_results=2000, num_burnin_steps=500) print('Acceptance Probabilities: ', acceptance_probs.numpy()) contextual_effects_pred_samples = ContextualEffectsPredictiveModel._make( samples)
Probabilidades de aceitação: [0.981 0.9795 0.972 0.9705]
for var in [ 'sigma_a', 'gamma_0', 'gamma_1', 'gamma_2', 'b', 'sigma_y', 'stl_pred' ]: print( 'R-hat for ', var, ': ', tfp.mcmc.potential_scale_reduction( getattr(contextual_effects_pred_samples, var)).numpy())
R-hat para sigma_a : 1.0053602 R-hat para gamma_0 : 1.0008001 R-hat para gamma_1 : 1.0015156 R-hat para gamma_2 : 0.99972683 R-hat para b : 1.0045198 R-hat para sigma_y : 1.0114483 R-hat para stl_pred : 1.0045049
plot_traces('stl_pred', contextual_effects_pred_samples.stl_pred, num_chains=4)
Image in a Jupyter notebook
plot_posterior('stl_pred', contextual_effects_pred_samples.stl_pred)
Image in a Jupyter notebook

7 – Conclusões

Benefícios dos modelos multiníveis:

  • Levam em conta a estrutura hierárquica natural de dados observacionais.

  • Estimam os coeficientes para grupos subrepresentados.

  • Incorporam informações no nível individual e de grupo ao estimar os coeficientes em nível de grupo.

  • Permitem a variação de coeficientes em nível individual entre os grupos.

Referências

Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (Análise de dados usando modelos de regressão e hierárquicos/multiníveis) (1ª edição). Cambridge University Press.

Gelman, A. (2006). Multilevel (Hierarchical) modeling: what it can and cannot do [Modelagem (hierárquica) multinível: o que é possível e não é possível fazer]. Technometrics, 48(3), 432–435.