GitHub Repository: AllenDowney/bayesian-analysis-recipes
Path: blob/master/docs/notebooks/markov-models.ipynb
⁴¹⁹ views

Kernel: Python 3

In [1]:

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [2]:

import pymc3 as pm

Markov Models From The Bottom Up, with Python

Markov models are a useful class of models for sequential-type of data. Before recurrent neural networks (which can be thought of as an upgraded Markov model) came along, Markov Models and their variants were the in thing for processing time series and biological data.

Just recently, I was involved in a project with a colleague, Zach Barry, where we thought the use of autoregressive hidden Markov models (AR-HMMs) might be a useful thing. Apart from our hack session one afternoon, it set off a series of self-study that culminated in this essay. By writing this down for my own memory, my hope is that it gives you a resource to refer back to as well.

You'll notice that I don't talk about inference (i.e. inferring parameters from data) until the end: this is intentional. As I've learned over the years doing statistical modelling and machine learning, nothing makes sense without first becoming deeply familiar with the "generative" story of each model, i.e. the algorithmic steps that let us generate data. It's a very Bayesian-influenced way of thinking that I hope you will become familiar with too.

Markov Models (HMMs): What they are, with in mostly plain English and some math

The simplest Markov Models assume that we have a system that contains a finite set of states, and that the system transitions between these states with some probability at each time step $t$ , thus generating a sequence of states over time. Let's call these states [ABD: a little ambiguity here if S is the set of states of the sequence over time] $S$ , where

S = \{s_1, s_2, ..., s_n\}

To keep things simple, let's stick with unique prime numbers and go with a three-state model:

S = \{s_1, s_2, s_3\}

They thus generate a sequence of states, with one possible realization being:

\{s_1, s_1, s_1, s_3, s_3, s_3, s_2, s_2, s_3, s_3, s_3, s_3, s_1, ...\}

Initializing a Markov chain

Every Markov chain needs to be initialized. To do so, we need an initial state probability matrix, which tells us what the distribution of initial states will be. Let's call the matrix $p_S$ , where the subscript $S$ indicates that it is for the "states".

p_S = \begin{pmatrix} p_1 & p_2 & p_3 \end{pmatrix}

Semantically, they allocate the probabilities of starting the sequence at a given state. For example, we might assume a discrete uniform distribution, which in Python would look like:

import numpy as np
p_init = np.array([1/3., 1/3., 1/3.])

Alternatively, we might assume a fixed starting point, which can be expressed as the $p_S$ array:

p_init = np.array([0, 1, 0])

Alternatively, we might assign non-zero probabilities to each in a non-uniform fashion:

# State 0: 0.1 probability
# State 1: 0.8 probability
# State 2: 0.1 probability
p_init = np.array([0.1, 0.8, 0.1])

Finally, we might assume that the system was long-running before we started observing the sequence of states, and as such the initial state was drawn as one realization of some equilibrated distribution of states. Keep this idea in your head, as we'll need it later.

For now, just to keep things concrete, let's specify an initial distribution as a non-uniform probability vector.

In [4]:

import numpy as np
p_init = np.array([0.1, 0.8, 0.1])

Modelling transitions between states

To know how a system transitions between states, we now need a transition matrix. The transition matrix describes the probability of transitioning from one state to another. (The probability of staying in the same state is semantically equivalent to transitioning to the same state.)

By convention, transition matrix rows correspond to the state at time $t$ , while columns correspond to state at time $t+1$ . Hence, row probabilities sum to one, because the probability of transitioning to the next state depends on only the current state, and all possible states are known and enumerated.

Let's call the transition matrix $p_T$ . The symbol etymology, which usually gets swept under the rug in mathematically-oriented papers, are as follows:

$T$ doesn't refer to time but simply indicates that it is for transitioning states,
$p$ is used because it is a probility matrix.

p_T = \begin{pmatrix} p_{11} & p_{12} & p_{13}\\ p_{21} & p_{22} & p_{23}\\ p_{31} & p_{32} & p_{33}\\ \end{pmatrix}

Using the transition matrix, we can express that the system likes to stay in the state that it enters into, by assigning larger probability mass to the diagonals. Alternatively, we can express that the system likes to transition out of states that it enters into, by assigning larger probability mass to the off-diagonal.

Alrighty, enough with that now, let's initialize a transition matrix below.

In [5]:

p_transition = np.array(
    [[0.90, 0.05, 0.05], 
     [0.01, 0.90, 0.09], 
     [0.07, 0.03, 0.9]]
)
p_transition

Out[5]:

array([[0.9 , 0.05, 0.05],
       [0.01, 0.9 , 0.09],
       [0.07, 0.03, 0.9 ]])

And just to confirm with you that each row sums to one:

In [6]:

assert p_transition[0, :].sum() == 1
assert p_transition[1, :].sum() == 1
assert p_transition[2, :].sum() == 1

Equilibrium or Stationary Distribution

Now, do you remember how above we talked about the Markov chain being in some "equilibrated" state? Well, the stationary or equilibrium distribution of a Markov chain is the distribution of observed states at infinite time. An interesting property is that regardless of what the initial state is, the equilibrium distribution will always be the same. ABD I think this can't always be true. For example, if the transition graph is not connected. But I don't know the general requirements.

The math, as it turns out, is also nothing more than a sequence of dot products between the state probability vector and the transition matrix. ABD There's a bit of a leap here. Maybe show how to generate a sequence of states, then show how to update a probability vector, then see how the updates behave in the long run?

ABD It looks like you are going all NumPy; do you have any interest in bringing Pandas into the mix? In that case, you could make a DataFrame with one row per timestep, one column per state, and then plotting would be more natural.

In [7]:

p_init_example = np.array([0.1, 0.8, 0.1])

In [8]:

p_state_t = [p_init_example]
p_transition_example = np.array(
    [[0.6,  0.2, 0.2], 
     [0.05, 0.9, 0.05], 
     [0.1,  0.2, 0.7]]
)

for i in range(200):  # 200 time steps sorta, kinda, approximates infinite time :)
    p_state_t.append(p_state_t[-1] @ p_transition_example)

In [9]:

import matplotlib.pyplot as plt
plt.plot(np.vstack(p_state_t))

Out[9]:

[<matplotlib.lines.Line2D at 0x7f493ed7ddc0>,
 <matplotlib.lines.Line2D at 0x7f493ed7de20>,
 <matplotlib.lines.Line2D at 0x7f493ed7dfa0>]

If you're viewing this notebook on Binder or locally, go ahead and modify the initial state to convince yourself that it doesn't matter what the initial state will be, the final state will always be the same as long as the transition matrix stays the same.

In [10]:

print(p_state_t[-1])

Out[10]:

[0.13333333 0.66666667 0.2       ]

As it turns out, there's also a way to solve for the equilibrium distribution analytically from the transition matrix. This involves solving a linear algebra problem, which we can do using Python. (Credit goes to this blog post from which I modified the code to fit the variable naming here.)

ABD Maybe break up some of those long lines of code?

In [11]:

def equilibrium_distribution(p_transition):
    n_states = p_transition.shape[0]
    A = np.append(p_transition.T - np.eye(n_states), np.ones(n_states).reshape(1, -1), axis=0)
    b = np.transpose(np.array([0] * n_states + [1]))
    p_eq = np.linalg.solve(np.transpose(A).dot(A), np.transpose(A).dot(b))
    return p_eq

print(equilibrium_distribution(p_transition_example))

Out[11]:

[0.13333333 0.66666667 0.2       ]

Generating a Markov Sequence

Generating a Markov sequence means we "forward" simulate the chain by:

(1) Optionally drawing an initial state from $p_S$ (let's call that $s_{t}$ ). This is done by drawing from a multinomial distribution:

s_t \sim Multinomial(1, p_S)

If we assume (and keep in mind that we don't have to) that the system was equilibrated before we started observing its state sequence, then the initial state distribution is equivalent to the equilibrium distribution. All this means that we don't necessarily have to specify the initial distribution explicitly.

(2) Drawing the next state by indexing into the transition matrix $p_T$ , and drawing a new state based on the Multinomial distribution:

s_{t+1} \sim Multinomial(1, p_{T_i})

where $i$ is the index of the state.

In Python code:

ABD It might be simpler to use np.random.choice with weights.

In [12]:

from scipy.stats import multinomial
from typing import List

def markov_sequence(p_init: np.array, p_transition: np.array, sequence_length: int) -> List[int]:
    """
    Generate a Markov sequence based on p_init and p_transition.
    """
    if p_init is None:
        p_init = equilibrium_distribution(p_transition)
    initial_state = list(multinomial.rvs(1, p_init)).index(1)
    
    states = [initial_state]
    for i in range(sequence_length - 1):
        p_tr = p_transition[states[-1]]
        new_state = list(multinomial.rvs(1, p_tr)).index(1)
        states.append(new_state)
    return states

With this function in hand, let's generate a sequence of length 1000.

In [13]:

import seaborn as sns

states = markov_sequence(p_init, p_transition, sequence_length=1000)
plt.plot(states)
plt.xlabel("time step")
plt.ylabel("state")
sns.despine()

Out[13]:

As is pretty evident from the transition probabilities, once the Markov chain enters a state, it tends not to move out of it.

If you've opened up this notebook in Binder or locally, feel free to modify the transition probabilities and initial state probabilities above to see how the Markov sequence changes.

Emissions: When Markov chains not only produce "states", but also observable data

So as you've seen above, a Markov chain can produce "states". If we are given direct access to the "states", then a problem that we may have is inferring the transition probabilities given the states.

A more common scenario, however, is that the states are latent, i.e. we cannot directly observe them. Instead, the latent states generate data that are given by some distribution conditioned on the state. We call these Hidden Markov Models.

That all sounds abstract, so let's try to make it more concrete.

Gaussian Emissions: When Markov chains emit Gaussian-distributed data.

With a three state model, we might say that the emissions are Gaussian distributed, but the location ( $\mu$ ) and scale ( $\sigma$ ) vary based on which state we are in. In the simplest case:

State 1 gives us data $y_1 \sim N(\mu=1, \sigma=0.2)$
State 2 gives us data $y_2 \sim N(\mu=0, \sigma=0.5)$
State 3 gives us data $y_3 \sim N(\mu=-1, \sigma=0.1)$

Turns out, we can model this in Python code too! ABD Well, Python is Turing complete, so we shouldn't be too surprised 😃

ABD: Maybe use np.random.normal

In [14]:

from scipy.stats import norm
def gaussian_emissions(states: List[int], mus: List[float], sigmas: List[float]) -> List[float]:
    emissions = []
    for state in states:
        loc = mus[state]
        scale = sigmas[state]
        e = norm.rvs(loc=loc, scale=scale)
        emissions.append(e)
    return emissions

Let's see what the emissions look like.

In [15]:

import seaborn as sns

gaussian_ems = gaussian_emissions(states, mus=[1, 0, -1], sigmas=[0.2, 0.5, 0.1])

def plot_emissions(states, emissions):
    fig, axes = plt.subplots(figsize=(16, 8), nrows=2, ncols=1, sharex=True)

    axes[0].plot(states)
    axes[0].set_title("States")
    axes[1].plot(emissions)
    axes[1].set_title("Emissions")
    sns.despine();
    
plot_emissions(states, gaussian_ems)

Out[15]:

Emission Distributions can be any valid distribution!

ABD: It might be enough to say this; I'm not sure we need another demo.

Nobody said we have to use Gaussian distributions for emissions; we can, in fact, have a ton of fun and start simulating data using other distributions!

Let's try Poisson emissions. Here, then, the poisson rate $\lambda$ is given one per state. In our example below:

State 1 gives us data $y_1 \sim Pois(\lambda=1)$
State 2 gives us data $y_2 \sim Pois(\lambda=10)$
State 3 gives us data $y_3 \sim Pois(\lambda=50)$

In [16]:

from scipy.stats import poisson
def poisson_emissions(states: List[int], lam: List[float]) -> List[int]:
    emissions = []
    for state in states:
        rate = lam[state]
        e = poisson.rvs(rate)
        emissions.append(e)
    return emissions

Once again, let's observe the emissions:

In [17]:

poisson_ems = poisson_emissions(states, lam=[1, 10, 50])
plot_emissions(states, poisson_ems)

Out[17]:

Hope the point is made: Take your favourite distribution and use it as the emission distribution!

Autoregressive Emissions

ABD: "Interesting" is kind of a weak transition; as a reader, I am losing the thread of where we are headed. I know you said you wanted to do a thorough tour of generation before inference, but you might need to do more to motivate generation.

Autoregressive emissions make things even more interesting and flexible! The "autoregressive" component tells us that the emission value does not only depend on the current state, but also on previous state(s).

How, though, can we enforce this dependency structure? Well, as implied by the term "structure", it means we have some set of equations that relate the parameters of the emission distribution to the value of the previous emission.

ABD A little confusing here: previous paragraph says it depends on previous states; this paragraph says it depends on previous emissions.

Heteroskedastic Autoregressive Emissions

Here's a "simple complex" example, where the location $\mu_t$ of the emission distribution at time $t$ depends on $y_{t-1}$ , and only the scale $\sigma$ depends only on the state.

ABD: Can you give an example of an application?

y_t \sim N(\mu=k y_{t-1}, \sigma=\sigma_{s_t})

Here, $k$ is an autoregressive coefficient that describes the strength of dependence on the previous state. We might also assume that the initial location $\mu=0$ . Because the scale $\sigma$ varies with state, the emissions are called heteroskedastic, which means "of non-constant variance". In the example below:

State 1 gives us $\sigma=0.5$ (kind of small variance).
State 2 gives us $\sigma=0.1$ (smaller variance).
State 3 gives us $\sigma=0.01$ (very small varaince).

In Python code, we would model it this way:

In [28]:

def ar_gaussian_heteroskedastic_emissions(states: List[int], k: float, sigmas: List[float]) -> List[float]:
    emissions = []
    prev_loc = 0
    for i, state in enumerate(states):
        e = norm.rvs(loc=k * prev_loc, scale=sigmas[state])
        emissions.append(e)
        prev_loc = e
    return emissions

In [29]:

ar_het_ems = ar_gaussian_heteroskedastic_emissions(states, k=1, sigmas=[0.5, 0.1, 0.01])
plot_emissions(states, ar_het_ems)

Out[29]:

Contrast that against vanilla Gaussian emissions that are non-autoregressive:

In [43]:

plot_emissions(states, gaussian_ems)

Out[43]:

Keep in mind, here, that regardless of what the emissions are, it is the variance around the heteroskedastic autoregressive emissions that gives us information about the state, not the location_. ABD Extra underline character

How does the autoregressive coefficient $k$ affect the Markov chain emissions?

As should be visible, the structure of autoregressiveness can really change how things look! What happens as $k$ changes?

In [44]:

ar_het_ems = ar_gaussian_heteroskedastic_emissions(states, k=1, sigmas=[0.5, 0.1, 0.01])
plot_emissions(states, ar_het_ems)

Out[44]:

In [45]:

ar_het_ems = ar_gaussian_heteroskedastic_emissions(states, k=0, sigmas=[0.5, 0.1, 0.01])
plot_emissions(states, ar_het_ems)

Out[45]:

Interesting stuff! As $k \rightarrow 0$ , we approach a heteroskedastic Gaussian random walk centered exactly on zero (which is exactly what the mean of the Gaussian emissions would place it ABD I don't understand this parenthetical remark), where only the variance of the observations, rather than the location of the observations, give us information about the state.

Homoskedastic Autoregressive Emissions

What if we wanted instead the variance to remain the same, but desired instead that the emission location $\mu$ gives us information about the state while still being autoregressive? Well, we can bake that into the equation structure!

y_t \sim N(\mu=k y_{t-1} + \mu_t, \sigma=1)

In Python code:

In [46]:

def ar_gaussian_homoskedastic_emissions(states: List[int], k: float, mus: List[float]) -> List[float]:
    emissions = []
    prev_loc = 0
    for i, state in enumerate(states):
        e = norm.rvs(loc=k * prev_loc + mus[state], scale=1)
        emissions.append(e)
        prev_loc = e
    return emissions

In [47]:

ar_hom_ems = ar_gaussian_homoskedastic_emissions(states, k=1, mus=[-10, 0, 10])
plot_emissions(states, ar_hom_ems)

Out[47]:

The variance is too small relative to the scale of the data, so it looks like smooth lines.

If we change $k$ , however, we get interesting effects.

In [48]:

ar_hom_ems = ar_gaussian_homoskedastic_emissions(states, k=0.8, mus=[-10, 0, 10])
plot_emissions(states, ar_hom_ems)

Out[48]:

Notice how we get "smoother" transitions into each state. It's less jumpy. This is extremely useful for modelling motion activity, for example, where people move into and out of states without having jumpy-switching. (We don't go from sitting to standing to walking by jumping frames, we ease into each.) ABD I like the connection to an application here.

Non-Autoregressive Homoskedastic Emissions

With non-autoregressive homoskedastic emissions, the mean gives us information, but the scale doesn't, and at the same time, the mean depends only on the state, and not on the previous state.

In [49]:

def gaussian_homoskedastic_emissions(states: List[int], mus: List[float]) -> List[float]:
    emissions = []
    
    prev_loc = 0
    for i, state in enumerate(states):
        e = norm.rvs(loc=mus[state], scale=1)
        emissions.append(e)
        prev_loc = e
    return emissions

In [50]:

hom_ems = gaussian_homoskedastic_emissions(states, mus=[-10, 0, 10])
plot_emissions(states, hom_ems)

Out[50]:

As you might intuit from looking at the equations, this is nothing more than a special case of the Heteroskedastic Gaussian Emissions example shown much earlier above.

Summary of MMs all the way to AR-HMMs

There's the plain old Markov Model, in which we might generate a sequence of states $S$ , which are generated from some initial distribution and transition matrix.

p_S = \begin{pmatrix} p_1 & p_2 & p_3 \end{pmatrix}

p_T = \begin{pmatrix} p_{11} & p_{12} & p_{13}\\ p_{21} & p_{22} & p_{23}\\ p_{31} & p_{32} & p_{33}\\ \end{pmatrix}

S = \{s_t, s_{t+1}, ... s_{t+n}\}

Then there's the "Hidden" Markov Model, in which we don't observe the states but rather the emissions generated from the states (according to some assumed distribution). Now, there's not only the initial distribution and transition matrix to worry about, but also the distribution of the emissions conditioned on the state. The general case is when we have some arbitrary distribution ABD To me, arbitrary distribution includes nonparametric, but then your examples are all parametric.(i.e. the Gaussian or the Poisson or the Chi-Squared - whichever fits the likelihood of your data best).

y_t|s_t \sim Dist(\theta_{t})

Where $\theta_t$ refers to the parameters for the generic distribution $Dist$ that are indexed by the state $s_t$ . Your distributions probably generally come from the same family (e.g. "Gaussians"), or you can go super complicated and generate them from different distributions.

In special cases, the parameters of the emission distribution can be held constant (i.e. simple random walks), or they can depend on the state (i.e. basic HMMs). If you make the variance of the likelihood distribution vary based on state, you get heteroskedastic HMMs; conversely, if you keep the variance constant, then you have homoskedastic HMMs.

Then, there's the "Autoregressive" Hidden Markov Models, in which the emissions generated from the states have a dependence on the previous states. Here, we have the ultimate amount of flexibility to model our processes.

y_t|s_t \sim Dist(f(y_{t-1}, \theta_t))

To keep things simple in this essay, we've only considered the case of lag of 1 (which is where the $t-1$ comes from). However, arbitrary numbers of time lags are possible too!

And, as usual, you can make them homoskedastic or heteroskedastic by simply controlling the variance parameter of the $Dist$ distribution.

Bonus point: your inputs ABD: Not immediately clear what inputs you mean. don't necessarily have to be single dimensional; they can be multidimensional too! As long as you write the $f(y_{t-1}, \theta_t)$ in a fashion that handles $y$ that are multidimensional, you're golden! Moreover, you can also write the function $f$ to be any function you like; it doesn't have to be a linear function (like we did); it can instead be a neural network if you so choose to do so, thus giving you a natural progression from Markov models to Recurrent Neural Networks. That, however, is out of scope for this essay.

Bayesian Inference on Markov Models

Now that we've gone through the "data generating process" for Markov sequences with emissions, we can re-examine the entire class of models in a Bayesian light.

If you've been observing the models that we've been "forward-simulating" all this while to generate data, you'll notice that there are a few key parameters that seemed like, "well, if we changed them, then the data would change, right?" If that's what you've been thinking, then bingo! You're on the right track.

Moreover, you'll notice that I've couched everything in the language of probability distributions. The transition probabilities $P(s_t | s_{t-1})$ are given by a Multinomial distribution. The emission probabilities are given by an arbitrary continuous (or discrete) distribution, depending on what the likelihood of the data are ABD this sentence is a little awkward. Given that we're working with probability distributions and data, you probably have been thinking about it already: we need a way to calculate the log-likelihoods of the data that we observe!

Markov Chain Log-Likelihood Calculation

ABD Why log-likelihoods, as opposed to just the likelihood of the data?

Let's examine how we would calculate the log likelihood of state data given the parameters. This will lead us to the Markov chain log-likelihood.

Since $P(s_t|s_{t-1})$ is a multinomial distribution, then if we are given the log-likelihood of $\{s_1, s_2, s_3, ..., s_n\}$ , we can calculate the log-likelihood over $\{s_2,... s_n\}$ , which is given by the sum of the log probabilities. This follows from the factorization of a Markov chain, which is out of scope for this essay, so if this trips you up, don't worry - take a hiatus from the essay and draw it out. Otherwise, take my word for it for now:

ABD I think you might have to explain what factorization is.

In [51]:

def state_logp(states, p_transition):
    logp = 0
    
    # states are 0, 1, 2, but we model them as [1, 0, 0], [0, 1, 0], [0, 0, 1]
    states_oh = np.eye(len(p_transition))
    for curr_state, next_state in zip(states[:-1], states[1:]):
        p_tr = p_transition[curr_state]        
        logp += multinomial(n=1, p=p_tr).logpmf(states_oh[next_state])
    return logp

state_logp(states, p_transition)

Out[51]:

-399.27943892038746

ABD: There were some big jumps; I think you are going to lose a lot of people here

ABD: On the other hand, I see that we are getting some work out of stats.Multinomial after all, so I withdraw my earlier suggestion

We will also write a vectorized version of state_logp.

In [52]:

def state_logp_vect(states, p_transition):
    states_oh = np.eye(len(p_transition))
    p_tr = p_transition[states[:-1]]
    obs = states_oh[states[1:]]
    return np.sum(multinomial(n=1, p=p_tr).logpmf(obs))

state_logp_vect(states, p_transition)

Out[52]:

-399.27943892039116

Now, there is a problem here: we also need the log likelihood of the first state.

Remember that if we don't know what the initial distribution is supposed to be, one possible assumption we can make is that the Markov sequence began by drawing from the equilibrium distribution. Here is where equilibrium distribution calculation from before comes in handy!

In [53]:

def initial_logp(states, p_transition):
    initial_state = states[0]
    states_oh = np.eye(len(p_transition))
    eq_p = equilibrium_distribution(p_transition)
    return (
        multinomial(n=1, p=eq_p)
        .logpmf(states_oh[initial_state].squeeze())
    )

initial_logp(states, p_transition)

Out[53]:

array(-1.27665118)

Taken together, we get the following Markov chain log-likelihood:

In [54]:

def markov_state_logp(states, p_transition):
    return (
        state_logp_vect(states, p_transition)
        + initial_logp(states, p_transition)
    )

markov_state_logp(states, p_transition)

Out[54]:

-400.55609010406124

Markov Chain with Gaussian Emissions Log-Likelihood Calculation

Now that we know how to calculate the log-likelihood for the Markov chain sequence of states, we can now move on to the log-likelihood calculation for the emissions.

Let's first assume that we have emissions that are non-autoregressive, and have a Gaussian likelihood.

In [55]:

def gaussian_logp(states, mus, sigmas, emissions):
    logp = 0
    for (emission, state) in zip(emissions, states):
        logp += norm(mus[state], sigmas[state]).logpdf(emission)
    return logp

gaussian_logp(states, mus=[1, 0, -1], sigmas=[0.2, 0.5, 0.1], emissions=gaussian_ems)

Out[55]:

174.79359697031498

And we'll also make a vectorized version of it:

ABD Maybe just do the vectorized version? I'm not sure the for-loop version really helps me.

In [56]:

def gaussian_logp_vect(states, mus, sigmas, emissions):
    mu = mus[states]
    sigma = sigmas[states]
    return np.sum(norm(mu, sigma).logpdf(emissions))

gaussian_logp_vect(states, mus=np.array([1, 0, -1]), sigmas=np.array([0.2, 0.5, 0.1]), emissions=gaussian_ems)

Out[56]:

174.7935969703148

The joint log likelihood of the emissions and states are then given by their summation.

In [57]:

def gaussian_emission_hmm_logp(states, p_transition, mus, sigmas, emissions):
    return markov_state_logp(states, p_transition) + gaussian_logp_vect(states, mus, sigmas, emissions)

gaussian_emission_hmm_logp(states, p_transition, mus=np.array([1, 0, -1]), sigmas=np.array([0.2, 0.5, 0.1]), emissions=gaussian_ems)

Out[57]:

-225.76249313374643

If you're in a Binder or local Jupyter session, go ahead and tweak the values of mus and sigmas, and verify for yourself that with the current values passed in, they are the "maximum likelihood" values. After all, our Gaussian emission data were generated according to this exact set of parameters!

Markov Chain with Autoregressive Gaussian Emissions Log-Likelihood Calculation

ABD Again, I think I would rather see a simple example done end-to-end before adding complexity.

I hope the pattern is starting to be clear here: since we have Gaussian emissions, we only have to calculate the parameters of the Gaussian to know what the logpdf would be.

As an example, I will be using the Gaussian with:

State-varying scale
Mean that is dependent on the previously emitted value

This is the AR-HMM with data generated from the ar_gaussian_heteroskedastic_emissions function.

In [58]:

def ar_gaussian_heteroskedastic_emissions_logp(states, k, sigmas, emissions):
    logp = 0
    initial_state = states[0]
    initial_emission_logp = norm(0, sigmas[initial_state]).logpdf(emissions[0])
    for previous_emission, current_emission, state in zip(emissions[:-1], emissions[1:], states[1:]):
        loc = k * previous_emission
        scale = sigmas[state]
        logp += norm(loc, scale).logpdf(current_emission)
    return logp

ar_gaussian_heteroskedastic_emissions_logp(states, k=1.0, sigmas=[0.5, 0.1, 0.01], emissions=ar_het_ems)

Out[58]:

-8927.259205506201

Now, we can write the full log likelihood of the entire AR-HMM:

In [59]:

def ar_gausian_heteroskedastic_hmm_logp(states, p_transition, k, sigmas, emissions):
    return (
        markov_state_logp(states, p_transition)
        + ar_gaussian_heteroskedastic_emissions_logp(states, k, sigmas, emissions)
    )


ar_gausian_heteroskedastic_hmm_logp(states, p_transition, k=1.0, sigmas=[0.5, 0.1, 0.01], emissions=ar_het_ems)

Out[59]:

-9327.815295610262

For those of you who are familiar with Bayesian inference, as soon as we have a log likelihood that we can calculate, once we tack on priors, using the simple Bayes' rule equation, we can obtain posterior distributions easily by chaining on an MCMC sampler.

ABD: With the tacking on and the chaining, this sentence is weighed down with metaphor.

If this looks all foreign to you, then be check out my other essay for a first look (or a refresher)!

HMM Distributions in PyMC3

While PyMC4 is in development, PyMC3 remains one of the leading probabilistic programming languages that can be used for Bayesian inference. PyMC3 doesn't have the HMM distribution defined in the library, but thanks to GitHub user @hstrey posting a Jupyter notebook with HMMs defined in there, many PyMC3 users have had a great baseline distribution to study pedagogically and use in their applications, myself included.

Side note: I used @hstrey's implementation before setting out to write this essay. Thanks!

HMM States Distribution

Let's first look at the HMM States distribution, which will give us a way to calculate the log probability of the states.

ABD This is a pretty big block of code to process.

In [61]:

import pymc3 as pm
import theano.tensor as tt
import theano.tensor.slinalg as sla  # theano-wrapped scipy linear algebra
import theano

theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"

class HMMStates(pm.Categorical):
    def __init__(self, p_transition, p_equilibrium, n_states, *args, **kwargs):
        super(pm.Categorical, self).__init__(*args, **kwargs)
        self.p_transition = p_transition
        self.p_equilibrium = p_equilibrium
        # This is needed
        self.k = n_states
        # This is only needed because discrete distributions must define a mode.
        self.mode = tt.cast(0,dtype='int64')

    def logp(self, x):
        p_eq = self.p_equilibrium
        # Broadcast out the transition probabilities.
        p_tr = self.p_transition[x[:-1]]
        
        # the logp of the initial state evaluated against the equilibrium probabilities
        initial_state_logp = pm.Categorical.dist(p_eq).logp(x[0])

        # the logp of the rest of the states.
        x_i = x[1:]
        ou_like = pm.Categorical.dist(p_tr).logp(x_i)
        transition_logp = tt.sum(ou_like)
        return initial_state_logp + transition_logp

Above, the categorical distribution is used for convenience - it can handle integers, while multinomial requires the one-hot transformation.

ABD: Maybe go back and use Categorical throughout?

Now, we stated earlier on that the transition matrix can be treated as a parameter to tweak, or else a random variable for which we want to infer its parameters. This means there is a natural fit for placing priors on them! Dirichlet distributions are great priors for probability vectors, as they are the generalization of Beta distributions.

In [62]:

def solve_equilibrium(n_states, p_transition):
    A = tt.dmatrix('A')
    A = tt.eye(n_states) - p_transition + tt.ones(shape=(n_states, n_states))
    p_equilibrium = pm.Deterministic("p_equilibrium", sla.solve(A.T, tt.ones(shape=(n_states))))
    return p_equilibrium

In [63]:

import numpy as np

n_states = 3
with pm.Model() as model:
    p_transition = pm.Dirichlet("p_transition", a=tt.ones((n_states, n_states)), shape=(n_states, n_states))
    
    # Solve for the equilibrium state
    p_equilibrium = solve_equilibrium(n_states, p_transition)

    obs_states = HMMStates(
        "states", 
        p_transition=p_transition, 
        p_equilibrium=p_equilibrium, 
        n_states=n_states,
        observed=np.array(states).astype("float")
    )

Out[63]:

/home/downey/anaconda3/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])
/home/downey/anaconda3/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])
/home/downey/anaconda3/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])

Now let's fit the model!

In [41]:

with model:
    trace = pm.sample(2000, cores=1, chains=4)

Out[41]:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
/home/ericmjl/anaconda/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])
Sequential sampling (4 chains in 1 job)
NUTS: [p_transition]
Sampling chain 0, 0 divergences:   0%|          | 0/2500 [00:00<?, ?it/s]/home/ericmjl/anaconda/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])
Sampling chain 0, 0 divergences:   0%|          | 1/2500 [00:00<20:01,  2.08it/s]/home/ericmjl/anaconda/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/slinalg.py:255: LinAlgWarning: Ill-conditioned matrix (rcond=5.89311e-08): result may not be accurate.
  rval = scipy.linalg.solve(A, b)
Sampling chain 0, 0 divergences: 100%|██████████| 2500/2500 [00:06<00:00, 385.54it/s]
Sampling chain 1, 0 divergences: 100%|██████████| 2500/2500 [00:05<00:00, 446.94it/s]
Sampling chain 2, 0 divergences: 100%|██████████| 2500/2500 [00:05<00:00, 438.24it/s]
Sampling chain 3, 0 divergences: 100%|██████████| 2500/2500 [00:05<00:00, 451.05it/s]
/home/ericmjl/anaconda/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])
/home/ericmjl/anaconda/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])

In [42]:

import arviz as az

az.plot_forest(trace, var_names=["p_transition"])

Out[42]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fae13aace50>],
      dtype=object)

It looks like we were able to recover the original transitions!

HMM with Gaussian Emissions

Let's try out now an HMM model with Gaussian emissions.

In [43]:

class HMMGaussianEmissions(pm.Continuous):
    def __init__(self, states, mu, sigma, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.states = states
        # self.rate = rate
        self.mu = mu
        self.sigma = sigma
    
    def logp(self, x):
        """
        x: observations
        """
        states = self.states
        # rate = self.rate[states]  # broadcast the rate across the states.
        mu = self.mu[states]
        sigma = self.sigma[states]
        return tt.sum(pm.Normal.dist(mu=mu, sigma=sigma).logp(x))

In [44]:

n_states = 3
with pm.Model() as model:
    # Priors for transition matrix
    p_transition = pm.Dirichlet("p_transition", a=tt.ones((n_states, n_states)), shape=(n_states, n_states))
    
    # Solve for the equilibrium state
    p_equilibrium = solve_equilibrium(n_states, p_transition)

    # HMM state
    hmm_states = HMMStates(
        "hmm_states", 
        p_transition=p_transition, 
        p_equilibrium=p_equilibrium, 
        n_states=n_states, 
        shape=(len(poisson_ems),)
    )
    
    # Prior for mu and sigma
    mu = pm.Normal("mu", mu=0, sigma=1, shape=(n_states,))
    sigma = pm.Exponential("sigma", lam=2, shape=(n_states,))
    
    # Observed emission likelihood
    obs = HMMGaussianEmissions(
        "emission", 
        states=hmm_states, 
        mu=mu, 
        sigma=sigma, 
        observed=gaussian_ems
    )

In [46]:

with model:
    trace = pm.sample(2000, cores=1, chains=4)

Out[46]:

/home/ericmjl/anaconda/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])
Sequential sampling (4 chains in 1 job)
CompoundStep
>NUTS: [sigma, mu, p_transition]
>CategoricalGibbsMetropolis: [hmm_states]
Sampling chain 0, 0 divergences: 100%|██████████| 2500/2500 [10:14<00:00,  4.07it/s]
Sampling chain 1, 40 divergences: 100%|██████████| 2500/2500 [10:06<00:00,  4.12it/s]
Sampling chain 2, 31 divergences: 100%|██████████| 2500/2500 [10:07<00:00,  4.11it/s]
Sampling chain 3, 8 divergences: 100%|██████████| 2500/2500 [10:03<00:00,  4.14it/s]
There were 149 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.501064229143012, but should be close to 0.8. Try to increase the number of tuning steps.
There were 226 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.4047283378922605, but should be close to 0.8. Try to increase the number of tuning steps.
There were 236 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.3759467589119244, but should be close to 0.8. Try to increase the number of tuning steps.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

In [47]:

az.plot_trace(trace, var_names=["mu"])

Out[47]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a5101f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a4965b0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a4a6940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a4c5e80>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae13ca5ee0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a4841f0>]],
      dtype=object)

In [48]:

az.plot_trace(trace, var_names=["sigma"])

Out[48]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a22f100>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a2ca1c0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a3359d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a237460>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a2677f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a1d0ee0>]],
      dtype=object)

In [49]:

az.plot_forest(trace, var_names=["sigma"])

Out[49]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a3cca60>],
      dtype=object)

We are able to recover the parameters, but there is significant intra-chain homogeneity. That is fine, though one way to get around this is to explicitly instantiate prior distributions for each of the parameters instead.

Autoregressive HMMs with Gaussian Emissions

Let's now add in the autoregressive component to it. The data we will use is the ar_het_ems data, which were generated by using a heteroskedastic assumption, with Gaussian emissions whose mean depends on the previous value, while variance depends on state.

As a reminder of what the data look like:

In [99]:

ar_het_ems = ar_gaussian_heteroskedastic_emissions(states, k=0.6, sigmas=[0.5, 0.1, 0.01])
plot_emissions(states, ar_het_ems)

Out[99]:

Let's now define the AR-HMM.

In [100]:

class ARHMMGaussianEmissions(pm.Continuous):
    def __init__(self, states, k, sigma, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.states = states
        self.sigma = sigma  # variance
        self.k = k  # autoregressive coefficient.

    def logp(self, x):
        """
        x: observations
        """
        states = self.states
        sigma = self.sigma[states]
        k = self.k
        
        ar_mean = k * x[:-1]
        ar_like = tt.sum(pm.Normal.dist(mu=ar_mean, sigma=sigma[1:]).logp(x[1:]))
        
        boundary_like = pm.Normal.dist(mu=0, sigma=sigma[0]).logp(x[0])
        return ar_like + boundary_like

In [101]:

n_states = 3
with pm.Model() as model:
    # Priors for transition matrix
    p_transition = pm.Dirichlet("p_transition", a=tt.ones((n_states, n_states)), shape=(n_states, n_states))
    
    # Solve for the equilibrium state
    p_equilibrium = solve_equilibrium(n_states, p_transition)

    # HMM state
    hmm_states = HMMStates(
        "hmm_states", 
        p_transition=p_transition, 
        p_equilibrium=p_equilibrium, 
        n_states=n_states, 
        shape=(len(poisson_ems),)
    )
    
    # Prior for sigma and k
    sigma = pm.Exponential("sigma", lam=2, shape=(n_states,))
    k = pm.Beta("k", alpha=2, beta=2) # a not-so-weak prior for k
    
    # Observed emission likelihood
    obs = ARHMMGaussianEmissions(
        "emission", 
        states=hmm_states, 
        sigma=sigma,
        k=k,
        observed=ar_het_ems
    )

Out[101]:

/home/ericmjl/anaconda/envs/bayesian-analysis-recipes/lib/python3.8/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])

In [102]:

with model:
    trace = pm.sample(2000, chains=4)

Out[102]:

Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>NUTS: [k, sigma, p_transition]
>CategoricalGibbsMetropolis: [hmm_states]
Sampling 4 chains, 0 divergences: 100%|██████████| 10000/10000 [11:28<00:00, 14.53draws/s]
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

Let's now take a look at the key parameters we might be interested in estimating:

$k$ : the autoregressive coefficient, or how much previous emissions influence current emissions.
$\sigma$ : the variance that belongs to each state.

In [103]:

az.plot_forest(trace, var_names=["k"])

Out[103]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fadf3eef160>],
      dtype=object)

In [105]:

az.plot_trace(trace, var_names=["k"])

Out[105]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae09431d00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae0992ddf0>]],
      dtype=object)

It looks like we were able to obtain the value of $k$ correctly!

In [106]:

az.plot_trace(trace, var_names=["sigma"])

Out[106]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a97c730>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae194e0430>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae18215910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae095eaf70>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fae08cb12e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fae0a7af2e0>]],
      dtype=object)

In [107]:

az.plot_forest(trace, var_names=["sigma"])

Out[107]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fae0888af40>],
      dtype=object)

It also looks like we were able to obtain the correct sigma values too, except that the chains are mixed up. We would do well to take care when calculating means for each parameter on the basis of chains.

How about the chain states? Did we get them right?

In [120]:

fig, ax = plt.subplots(figsize=(12, 4))
plt.plot(np.round(trace["hmm_states"].mean(axis=0)))
plt.plot(2 - np.array(states))

Out[120]:

[<matplotlib.lines.Line2D at 0x7fae13fa5880>]

I had to flip the states because they were backwards relative to the original.

Qualitatively, not bad! If we wanted to be a bit more rigorous, we would quantify the accuracy of state identification.

If the transition probabilities were a bit more extreme, we might have an easier time with the identifiability of the states. As it stands, because the variance is the only thing that changes, and because the variance of two of the three states are quite similar (one is 0.1 and the other is 0.5), distinguishing between these two states may be more difficult because of the autoregressive component suppressing variability of the emissions.

Concluding Notes

Nothing in statistics makes sense...

...unless in light of a "data generating model".

I initially struggled with the math behind HMMs and its variants, because I had never taken the time to think through the "data generating process" carefully. Once we have the data generating process, and in particular, its structure, it becomes trivial to map the structure of the model to the equations that are needed to model it. (I think this is why physicists are such good Bayesians: they are well-trained at thinking about mechanistic, data generating models.)

For example, with autoregressive HMMs, until I sat down and thought through the data generating process step-by-step, nothing made sense. Once I wrote out how the mean of the previous observation influenced the mean of the current observation, then things made a ton of sense.

In fact, now that I look back on my learning journey in Bayesian statistics, if we can define a likelihood function for our data, we can trivially work backwards and design a data generating process.

Model structure is important

While writing out the PyMC3 implementations and conditioning them on data, I remember times when I mismatched the model to the data, thus generating posterior samples that exhibited pathologies: divergences and more. This is a reminder that getting the structure of the model is very important.

Keep learning

I hope this essay was useful for your learning journey as well. If you enjoyed it, please take a moment to star the repository!

Markov Models From The Bottom Up, with Python

Markov Models (HMMs): What they are, with in mostly plain English and some math

Initializing a Markov chain

Modelling transitions between states

Equilibrium or Stationary Distribution

Generating a Markov Sequence

Emissions: When Markov chains not only produce "states", but also observable data

Gaussian Emissions: When Markov chains emit Gaussian-distributed data.

Emission Distributions can be any valid distribution!

Autoregressive Emissions

Heteroskedastic Autoregressive Emissions

How does the autoregressive coefficient $k$ affect the Markov chain emissions?

Homoskedastic Autoregressive Emissions

Non-Autoregressive Homoskedastic Emissions

Summary of MMs all the way to AR-HMMs

Bayesian Inference on Markov Models

Markov Chain Log-Likelihood Calculation

Markov Chain with Gaussian Emissions Log-Likelihood Calculation

Markov Chain with Autoregressive Gaussian Emissions Log-Likelihood Calculation

HMM Distributions in PyMC3

HMM States Distribution

HMM with Gaussian Emissions

Autoregressive HMMs with Gaussian Emissions

Concluding Notes

Nothing in statistics makes sense...

Model structure is important

Keep learning

Product

Resources

Company

Markov Models From The Bottom Up, with Python

Markov Models (HMMs): What they are, with in mostly plain English and some math

Initializing a Markov chain

Modelling transitions between states

Equilibrium or Stationary Distribution

Generating a Markov Sequence

Emissions: When Markov chains not only produce "states", but also observable data

Gaussian Emissions: When Markov chains emit Gaussian-distributed data.

Emission Distributions can be any valid distribution!

Autoregressive Emissions

Heteroskedastic Autoregressive Emissions

How does the autoregressive coefficient kkk affect the Markov chain emissions?

Homoskedastic Autoregressive Emissions

Non-Autoregressive Homoskedastic Emissions

Summary of MMs all the way to AR-HMMs

Bayesian Inference on Markov Models

Markov Chain Log-Likelihood Calculation

Markov Chain with Gaussian Emissions Log-Likelihood Calculation

Markov Chain with Autoregressive Gaussian Emissions Log-Likelihood Calculation

HMM Distributions in PyMC3

HMM States Distribution

HMM with Gaussian Emissions

Autoregressive HMMs with Gaussian Emissions

Concluding Notes

Nothing in statistics makes sense...

Model structure is important

Keep learning

How does the autoregressive coefficient $k$ affect the Markov chain emissions?