Path: blob/master/docs/notebooks/markov-models.ipynb
419 views
Markov Models From The Bottom Up, with Python
Markov models are a useful class of models for sequential-type of data. Before recurrent neural networks (which can be thought of as an upgraded Markov model) came along, Markov Models and their variants were the in thing for processing time series and biological data.
Just recently, I was involved in a project with a colleague, Zach Barry, where we thought the use of autoregressive hidden Markov models (AR-HMMs) might be a useful thing. Apart from our hack session one afternoon, it set off a series of self-study that culminated in this essay. By writing this down for my own memory, my hope is that it gives you a resource to refer back to as well.
You'll notice that I don't talk about inference (i.e. inferring parameters from data) until the end: this is intentional. As I've learned over the years doing statistical modelling and machine learning, nothing makes sense without first becoming deeply familiar with the "generative" story of each model, i.e. the algorithmic steps that let us generate data. It's a very Bayesian-influenced way of thinking that I hope you will become familiar with too.
Markov Models (HMMs): What they are, with in mostly plain English and some math
The simplest Markov Models assume that we have a system that contains a finite set of states, and that the system transitions between these states with some probability at each time step , thus generating a sequence of states over time. Let's call these states [ABD: a little ambiguity here if S is the set of states of the sequence over time] , where
To keep things simple, let's stick with unique prime numbers and go with a three-state model:
They thus generate a sequence of states, with one possible realization being:
Initializing a Markov chain
Every Markov chain needs to be initialized. To do so, we need an initial state probability matrix, which tells us what the distribution of initial states will be. Let's call the matrix , where the subscript indicates that it is for the "states".
Semantically, they allocate the probabilities of starting the sequence at a given state. For example, we might assume a discrete uniform distribution, which in Python would look like:
Alternatively, we might assume a fixed starting point, which can be expressed as the array:
Alternatively, we might assign non-zero probabilities to each in a non-uniform fashion:
Finally, we might assume that the system was long-running before we started observing the sequence of states, and as such the initial state was drawn as one realization of some equilibrated distribution of states. Keep this idea in your head, as we'll need it later.
For now, just to keep things concrete, let's specify an initial distribution as a non-uniform probability vector.
Modelling transitions between states
To know how a system transitions between states, we now need a transition matrix. The transition matrix describes the probability of transitioning from one state to another. (The probability of staying in the same state is semantically equivalent to transitioning to the same state.)
By convention, transition matrix rows correspond to the state at time , while columns correspond to state at time . Hence, row probabilities sum to one, because the probability of transitioning to the next state depends on only the current state, and all possible states are known and enumerated.
Let's call the transition matrix . The symbol etymology, which usually gets swept under the rug in mathematically-oriented papers, are as follows:
doesn't refer to time but simply indicates that it is for transitioning states,
is used because it is a probility matrix.
Using the transition matrix, we can express that the system likes to stay in the state that it enters into, by assigning larger probability mass to the diagonals. Alternatively, we can express that the system likes to transition out of states that it enters into, by assigning larger probability mass to the off-diagonal.
Alrighty, enough with that now, let's initialize a transition matrix below.
And just to confirm with you that each row sums to one:
Equilibrium or Stationary Distribution
Now, do you remember how above we talked about the Markov chain being in some "equilibrated" state? Well, the stationary or equilibrium distribution of a Markov chain is the distribution of observed states at infinite time. An interesting property is that regardless of what the initial state is, the equilibrium distribution will always be the same. ABD I think this can't always be true. For example, if the transition graph is not connected. But I don't know the general requirements.
The math, as it turns out, is also nothing more than a sequence of dot products between the state probability vector and the transition matrix. ABD There's a bit of a leap here. Maybe show how to generate a sequence of states, then show how to update a probability vector, then see how the updates behave in the long run?
ABD It looks like you are going all NumPy; do you have any interest in bringing Pandas into the mix? In that case, you could make a DataFrame with one row per timestep, one column per state, and then plotting would be more natural.
If you're viewing this notebook on Binder or locally, go ahead and modify the initial state to convince yourself that it doesn't matter what the initial state will be, the final state will always be the same as long as the transition matrix stays the same.
As it turns out, there's also a way to solve for the equilibrium distribution analytically from the transition matrix. This involves solving a linear algebra problem, which we can do using Python. (Credit goes to this blog post from which I modified the code to fit the variable naming here.)
ABD Maybe break up some of those long lines of code?
Generating a Markov Sequence
Generating a Markov sequence means we "forward" simulate the chain by:
(1) Optionally drawing an initial state from (let's call that ). This is done by drawing from a multinomial distribution:
If we assume (and keep in mind that we don't have to) that the system was equilibrated before we started observing its state sequence, then the initial state distribution is equivalent to the equilibrium distribution. All this means that we don't necessarily have to specify the initial distribution explicitly.
(2) Drawing the next state by indexing into the transition matrix , and drawing a new state based on the Multinomial distribution:
where is the index of the state.
In Python code:
ABD It might be simpler to use np.random.choice with weights.
With this function in hand, let's generate a sequence of length 1000.
As is pretty evident from the transition probabilities, once the Markov chain enters a state, it tends not to move out of it.
If you've opened up this notebook in Binder or locally, feel free to modify the transition probabilities and initial state probabilities above to see how the Markov sequence changes.
Emissions: When Markov chains not only produce "states", but also observable data
So as you've seen above, a Markov chain can produce "states". If we are given direct access to the "states", then a problem that we may have is inferring the transition probabilities given the states.
A more common scenario, however, is that the states are latent, i.e. we cannot directly observe them. Instead, the latent states generate data that are given by some distribution conditioned on the state. We call these Hidden Markov Models.
That all sounds abstract, so let's try to make it more concrete.
Gaussian Emissions: When Markov chains emit Gaussian-distributed data.
With a three state model, we might say that the emissions are Gaussian distributed, but the location () and scale () vary based on which state we are in. In the simplest case:
State 1 gives us data
State 2 gives us data
State 3 gives us data
Turns out, we can model this in Python code too! ABD Well, Python is Turing complete, so we shouldn't be too surprised 😃
ABD: Maybe use np.random.normal
Let's see what the emissions look like.
Emission Distributions can be any valid distribution!
ABD: It might be enough to say this; I'm not sure we need another demo.
Nobody said we have to use Gaussian distributions for emissions; we can, in fact, have a ton of fun and start simulating data using other distributions!
Let's try Poisson emissions. Here, then, the poisson rate is given one per state. In our example below:
State 1 gives us data
State 2 gives us data
State 3 gives us data
Once again, let's observe the emissions:
Hope the point is made: Take your favourite distribution and use it as the emission distribution!
Autoregressive Emissions
ABD: "Interesting" is kind of a weak transition; as a reader, I am losing the thread of where we are headed. I know you said you wanted to do a thorough tour of generation before inference, but you might need to do more to motivate generation.
Autoregressive emissions make things even more interesting and flexible! The "autoregressive" component tells us that the emission value does not only depend on the current state, but also on previous state(s).
How, though, can we enforce this dependency structure? Well, as implied by the term "structure", it means we have some set of equations that relate the parameters of the emission distribution to the value of the previous emission.
ABD A little confusing here: previous paragraph says it depends on previous states; this paragraph says it depends on previous emissions.
Heteroskedastic Autoregressive Emissions
Here's a "simple complex" example, where the location of the emission distribution at time depends on , and only the scale depends only on the state.
ABD: Can you give an example of an application?
Here, is an autoregressive coefficient that describes the strength of dependence on the previous state. We might also assume that the initial location . Because the scale varies with state, the emissions are called heteroskedastic, which means "of non-constant variance". In the example below:
State 1 gives us (kind of small variance).
State 2 gives us (smaller variance).
State 3 gives us (very small varaince).
In Python code, we would model it this way:
Contrast that against vanilla Gaussian emissions that are non-autoregressive:
Keep in mind, here, that regardless of what the emissions are, it is the variance around the heteroskedastic autoregressive emissions that gives us information about the state, not the location_. ABD Extra underline character
How does the autoregressive coefficient affect the Markov chain emissions?
As should be visible, the structure of autoregressiveness can really change how things look! What happens as changes?
Interesting stuff! As , we approach a heteroskedastic Gaussian random walk centered exactly on zero (which is exactly what the mean of the Gaussian emissions would place it ABD I don't understand this parenthetical remark), where only the variance of the observations, rather than the location of the observations, give us information about the state.
Homoskedastic Autoregressive Emissions
What if we wanted instead the variance to remain the same, but desired instead that the emission location gives us information about the state while still being autoregressive? Well, we can bake that into the equation structure!
In Python code:
The variance is too small relative to the scale of the data, so it looks like smooth lines.
If we change , however, we get interesting effects.
Notice how we get "smoother" transitions into each state. It's less jumpy. This is extremely useful for modelling motion activity, for example, where people move into and out of states without having jumpy-switching. (We don't go from sitting to standing to walking by jumping frames, we ease into each.) ABD I like the connection to an application here.
Non-Autoregressive Homoskedastic Emissions
With non-autoregressive homoskedastic emissions, the mean gives us information, but the scale doesn't, and at the same time, the mean depends only on the state, and not on the previous state.
As you might intuit from looking at the equations, this is nothing more than a special case of the Heteroskedastic Gaussian Emissions example shown much earlier above.
Summary of MMs all the way to AR-HMMs
There's the plain old Markov Model, in which we might generate a sequence of states , which are generated from some initial distribution and transition matrix.
Then there's the "Hidden" Markov Model, in which we don't observe the states but rather the emissions generated from the states (according to some assumed distribution). Now, there's not only the initial distribution and transition matrix to worry about, but also the distribution of the emissions conditioned on the state. The general case is when we have some arbitrary distribution ABD To me, arbitrary distribution includes nonparametric, but then your examples are all parametric.(i.e. the Gaussian or the Poisson or the Chi-Squared - whichever fits the likelihood of your data best).
Where refers to the parameters for the generic distribution that are indexed by the state . Your distributions probably generally come from the same family (e.g. "Gaussians"), or you can go super complicated and generate them from different distributions.
In special cases, the parameters of the emission distribution can be held constant (i.e. simple random walks), or they can depend on the state (i.e. basic HMMs). If you make the variance of the likelihood distribution vary based on state, you get heteroskedastic HMMs; conversely, if you keep the variance constant, then you have homoskedastic HMMs.
Then, there's the "Autoregressive" Hidden Markov Models, in which the emissions generated from the states have a dependence on the previous states. Here, we have the ultimate amount of flexibility to model our processes.
To keep things simple in this essay, we've only considered the case of lag of 1 (which is where the comes from). However, arbitrary numbers of time lags are possible too!
And, as usual, you can make them homoskedastic or heteroskedastic by simply controlling the variance parameter of the distribution.
Bonus point: your inputs ABD: Not immediately clear what inputs you mean. don't necessarily have to be single dimensional; they can be multidimensional too! As long as you write the in a fashion that handles that are multidimensional, you're golden! Moreover, you can also write the function to be any function you like; it doesn't have to be a linear function (like we did); it can instead be a neural network if you so choose to do so, thus giving you a natural progression from Markov models to Recurrent Neural Networks. That, however, is out of scope for this essay.
Bayesian Inference on Markov Models
Now that we've gone through the "data generating process" for Markov sequences with emissions, we can re-examine the entire class of models in a Bayesian light.
If you've been observing the models that we've been "forward-simulating" all this while to generate data, you'll notice that there are a few key parameters that seemed like, "well, if we changed them, then the data would change, right?" If that's what you've been thinking, then bingo! You're on the right track.
Moreover, you'll notice that I've couched everything in the language of probability distributions. The transition probabilities are given by a Multinomial distribution. The emission probabilities are given by an arbitrary continuous (or discrete) distribution, depending on what the likelihood of the data are ABD this sentence is a little awkward. Given that we're working with probability distributions and data, you probably have been thinking about it already: we need a way to calculate the log-likelihoods of the data that we observe!
Markov Chain Log-Likelihood Calculation
ABD Why log-likelihoods, as opposed to just the likelihood of the data?
Let's examine how we would calculate the log likelihood of state data given the parameters. This will lead us to the Markov chain log-likelihood.
Since is a multinomial distribution, then if we are given the log-likelihood of , we can calculate the log-likelihood over , which is given by the sum of the log probabilities. This follows from the factorization of a Markov chain, which is out of scope for this essay, so if this trips you up, don't worry - take a hiatus from the essay and draw it out. Otherwise, take my word for it for now:
ABD I think you might have to explain what factorization is.
ABD: There were some big jumps; I think you are going to lose a lot of people here
ABD: On the other hand, I see that we are getting some work out of stats.Multinomial after all, so I withdraw my earlier suggestion
We will also write a vectorized version of state_logp
.
Now, there is a problem here: we also need the log likelihood of the first state.
Remember that if we don't know what the initial distribution is supposed to be, one possible assumption we can make is that the Markov sequence began by drawing from the equilibrium distribution. Here is where equilibrium distribution calculation from before comes in handy!
Taken together, we get the following Markov chain log-likelihood:
Markov Chain with Gaussian Emissions Log-Likelihood Calculation
Now that we know how to calculate the log-likelihood for the Markov chain sequence of states, we can now move on to the log-likelihood calculation for the emissions.
Let's first assume that we have emissions that are non-autoregressive, and have a Gaussian likelihood.
And we'll also make a vectorized version of it:
ABD Maybe just do the vectorized version? I'm not sure the for-loop version really helps me.
The joint log likelihood of the emissions and states are then given by their summation.
If you're in a Binder or local Jupyter session, go ahead and tweak the values of mus
and sigmas
, and verify for yourself that with the current values passed in, they are the "maximum likelihood" values. After all, our Gaussian emission data were generated according to this exact set of parameters!
Markov Chain with Autoregressive Gaussian Emissions Log-Likelihood Calculation
ABD Again, I think I would rather see a simple example done end-to-end before adding complexity.
I hope the pattern is starting to be clear here: since we have Gaussian emissions, we only have to calculate the parameters of the Gaussian to know what the logpdf would be.
As an example, I will be using the Gaussian with:
State-varying scale
Mean that is dependent on the previously emitted value
This is the AR-HMM with data generated from the ar_gaussian_heteroskedastic_emissions
function.
Now, we can write the full log likelihood of the entire AR-HMM:
For those of you who are familiar with Bayesian inference, as soon as we have a log likelihood that we can calculate, once we tack on priors, using the simple Bayes' rule equation, we can obtain posterior distributions easily by chaining on an MCMC sampler.
ABD: With the tacking on and the chaining, this sentence is weighed down with metaphor.
If this looks all foreign to you, then be check out my other essay for a first look (or a refresher)!
HMM Distributions in PyMC3
While PyMC4 is in development, PyMC3 remains one of the leading probabilistic programming languages that can be used for Bayesian inference. PyMC3 doesn't have the HMM distribution defined in the library, but thanks to GitHub user @hstrey posting a Jupyter notebook with HMMs defined in there, many PyMC3 users have had a great baseline distribution to study pedagogically and use in their applications, myself included.
Side note: I used @hstrey's implementation before setting out to write this essay. Thanks!
HMM States Distribution
Let's first look at the HMM States distribution, which will give us a way to calculate the log probability of the states.
ABD This is a pretty big block of code to process.
Above, the categorical distribution is used for convenience - it can handle integers, while multinomial requires the one-hot transformation.
ABD: Maybe go back and use Categorical throughout?
Now, we stated earlier on that the transition matrix can be treated as a parameter to tweak, or else a random variable for which we want to infer its parameters. This means there is a natural fit for placing priors on them! Dirichlet distributions are great priors for probability vectors, as they are the generalization of Beta distributions.
Now let's fit the model!
It looks like we were able to recover the original transitions!
HMM with Gaussian Emissions
Let's try out now an HMM model with Gaussian emissions.
We are able to recover the parameters, but there is significant intra-chain homogeneity. That is fine, though one way to get around this is to explicitly instantiate prior distributions for each of the parameters instead.
Autoregressive HMMs with Gaussian Emissions
Let's now add in the autoregressive component to it. The data we will use is the ar_het_ems
data, which were generated by using a heteroskedastic assumption, with Gaussian emissions whose mean depends on the previous value, while variance depends on state.
As a reminder of what the data look like:
Let's now define the AR-HMM.
Let's now take a look at the key parameters we might be interested in estimating:
: the autoregressive coefficient, or how much previous emissions influence current emissions.
: the variance that belongs to each state.
It looks like we were able to obtain the value of correctly!
It also looks like we were able to obtain the correct sigma values too, except that the chains are mixed up. We would do well to take care when calculating means for each parameter on the basis of chains.
How about the chain states? Did we get them right?
I had to flip the states because they were backwards relative to the original.
Qualitatively, not bad! If we wanted to be a bit more rigorous, we would quantify the accuracy of state identification.
If the transition probabilities were a bit more extreme, we might have an easier time with the identifiability of the states. As it stands, because the variance is the only thing that changes, and because the variance of two of the three states are quite similar (one is 0.1 and the other is 0.5), distinguishing between these two states may be more difficult because of the autoregressive component suppressing variability of the emissions.
Concluding Notes
Nothing in statistics makes sense...
...unless in light of a "data generating model".
I initially struggled with the math behind HMMs and its variants, because I had never taken the time to think through the "data generating process" carefully. Once we have the data generating process, and in particular, its structure, it becomes trivial to map the structure of the model to the equations that are needed to model it. (I think this is why physicists are such good Bayesians: they are well-trained at thinking about mechanistic, data generating models.)
For example, with autoregressive HMMs, until I sat down and thought through the data generating process step-by-step, nothing made sense. Once I wrote out how the mean of the previous observation influenced the mean of the current observation, then things made a ton of sense.
In fact, now that I look back on my learning journey in Bayesian statistics, if we can define a likelihood function for our data, we can trivially work backwards and design a data generating process.
Model structure is important
While writing out the PyMC3 implementations and conditioning them on data, I remember times when I mismatched the model to the data, thus generating posterior samples that exhibited pathologies: divergences and more. This is a reminder that getting the structure of the model is very important.
Keep learning
I hope this essay was useful for your learning journey as well. If you enjoyed it, please take a moment to star the repository!