Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
AllenDowney
GitHub Repository: AllenDowney/bayesian-analysis-recipes
Path: blob/master/incubator/hierarchical-modelling.ipynb
411 views
Kernel: bayesian
import pymc3 as pm import numpy as np import matplotlib.pyplot as plt import pandas as pd %load_ext autoreload %autoreload 2 %matplotlib inline %config InlineBackend.figure_format = 'retina'
import theano theano.config.compute_test_value = 'off' theano.config.compute_test_value

Introduction

The purpose of this notebook is to show Bayesian hierarchical models.

Hierarchical models are used in modelling things like baseball batting averages or housing prices. It's useful when we have some "population" parameter that we think influences an individual sample's parameter. Let's dive in to see how this works.

# def ecdf_scatter(data): # x, y = np.sort(data), np.arange(1, len(data)+1) / len(data) # return x, y # x1, y1 = ecdf_scatter(np.random.beta(1, 1, size=1000)) # x2, y2 = ecdf_scatter(np.random.beta(2, 10, size=1000)) # x3, y3 = ecdf_scatter(np.random.beta(46, 200, size=1000)) # fig = plt.figure() # ax = fig.add_subplot(1,1,1) # ax.scatter(x1, y1, label="(1, 1)") # ax.scatter(x2, y2, label="(2, 10)") # ax.scatter(x3, y3, label="(46, 200)") # ax.legend()

We will use the baseball dataset from Efron & Morris. It's pretty famous, and can be downloaded from the PyMC3 website.

data = pd.read_csv('datasets/baseball.tsv', sep='\t') # data = pm.floatX(data) data.dtypes datacols = list(data.columns) datacols.remove('FirstName') datacols.remove('LastName') data[datacols] = pm.floatX(data[datacols]) data.dtypes

Given this population data, let's now say that we have a rookie player with just six at bats and all of them were hit. Is this a genius player to keep for the season? Bayesian statistics may help us get to an answer.

# Firstly, we estimate the population batting average. with pm.Model() as baseball_model: atbats = pm.floatX(data['SeasonAt-Bats'].values) hits = pm.floatX(data['SeasonHits'].values) a = pm.Exponential('alpha_prior', lam=5) # testval=a_testval) b = pm.Exponential('beta_prior', lam=5) #testval=b_testval) p_population = pm.Beta('p_population', alpha=a, beta=b) # testval=p_testval) likelihood = pm.Binomial('likelihood', n=atbats, p=p_population, observed=hits) trace = pm.sample(2000)
with baseball_model: pm.traceplot(trace, varnames=['p_population'])
with pm.Model() as new_player_model: p_prior =