GitHub Repository: AllenDowney/bayesian-analysis-recipes
Path: blob/master/incubator/hierarchical-modelling.ipynb
⁴¹¹ views

Kernel: bayesian

In [ ]:

import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [ ]:

import theano
theano.config.compute_test_value = 'off'
theano.config.compute_test_value

Introduction

The purpose of this notebook is to show Bayesian hierarchical models.

Hierarchical models are used in modelling things like baseball batting averages or housing prices. It's useful when we have some "population" parameter that we think influences an individual sample's parameter. Let's dive in to see how this works.

In [ ]:

# def ecdf_scatter(data):
#     x, y = np.sort(data), np.arange(1, len(data)+1) / len(data)
#     return x, y

# x1, y1 = ecdf_scatter(np.random.beta(1, 1, size=1000))
# x2, y2 = ecdf_scatter(np.random.beta(2, 10, size=1000))
# x3, y3 = ecdf_scatter(np.random.beta(46, 200, size=1000))

# fig = plt.figure()
# ax = fig.add_subplot(1,1,1)
# ax.scatter(x1, y1, label="(1, 1)")
# ax.scatter(x2, y2, label="(2, 10)")
# ax.scatter(x3, y3, label="(46, 200)")
# ax.legend()

We will use the baseball dataset from Efron & Morris. It's pretty famous, and can be downloaded from the PyMC3 website.

In [ ]:

data = pd.read_csv('datasets/baseball.tsv', sep='\t')
# data = pm.floatX(data)
data.dtypes
datacols = list(data.columns)
datacols.remove('FirstName')
datacols.remove('LastName')
data[datacols] = pm.floatX(data[datacols])
data.dtypes

Given this population data, let's now say that we have a rookie player with just six at bats and all of them were hit. Is this a genius player to keep for the season? Bayesian statistics may help us get to an answer.

In [ ]:

# Firstly, we estimate the population batting average.
with pm.Model() as baseball_model:
    atbats = pm.floatX(data['SeasonAt-Bats'].values)
    hits = pm.floatX(data['SeasonHits'].values)
    
    a = pm.Exponential('alpha_prior', 
                       lam=5) # testval=a_testval)
    b = pm.Exponential('beta_prior', 
                       lam=5) #testval=b_testval)
    p_population = pm.Beta('p_population', 
                           alpha=a, 
                           beta=b) # testval=p_testval)
    likelihood = pm.Binomial('likelihood', 
                             n=atbats, 
                             p=p_population, 
                             observed=hits)
    trace = pm.sample(2000)

In [ ]:

with baseball_model:
    pm.traceplot(trace, varnames=['p_population'])

In [ ]:

with pm.Model() as new_player_model:
    p_prior =

Introduction

Product

Resources

Company