Lab 6: Hidden Markov Model
In this lab we will look into Hidden Markov Models (HMM) to model sequential data. HMMs use Markov Chains to compute probabilities for a sequence of observed events. Moreover, HMMs are based on the Markov assumption, which states that the present state is sufficient to predict the future , so the past can be discarded.
Often, we are in a situation where the states we are interested in are hidden, we cannot observe them directly. As we will see in Exercise 2, the part-of-speech (POS) tags in a text are hidden and we can only observe the words. From these words we have to infer the tags. Similarly in Exercises 3 where a robot needs to be localised, we cannot observe the robot's position but rather the measurements of its sensors. Both the tags and the robot position are called hidden variables because they are not observed. An HMM is specified by the following components:
A set of states.
A transition probability matrix where each element represents the probability of moving from state to state , s.t.
An emission probability distribution, the probabilities of observations being generated from a state
An initial probability distribution over states. is the probability that the Markov chain will start in state . Also,
Additional packages (if using your own machine)
For this lab we need hmmlearn
, nltk
, ipywidgets
, so if you are using your own machine which you can install with conda:
Let's import the libraries:
1) HMM with 2 states
We will start with a warm up exercise where you have to implement an HMM with 2 states and then analyse how the transition probability and emission noise influence the output. Your task is to create a function hmm_model
that has three arguments:
switch_prob
: The probability of switching to the other state.noise_level
: The variance of the emission distribution, i.e. the measurement/observation noise.startprob
: The probabilities of starting in each state (shape=(2,1)
).
Your function should create a hmm.GaussianHMM model with two states (hidden states) and covariance_type="full"
. Modify the following model attributes:
startprob_
: Using thestartprob
argument.transmat_
: Implement a transition matrix with probability of transition for both states beingswitch_prob
.means_
: Add mean and for each statecovars_
: Using the thenoise_level
argument. (note: the shape should be(2, 1, 1)
, i.e. two covariance matrices)
Next, execute the cell below to run your hmm_model
and visualise the output with the provided plot_hmm
function.
Run the cell below and experiment with the interactive widget window. Try changing the value of switch_prob
and noise_level
, how does that changes the output?
2) HMM Part-of-Speech Tagging
Part-of-speech (POS) tagging enables the extraction of meaningful information about words in a sentence and their relation to neighbouring words. Knowing whether a word is a noun or a verb provides us information about their most likely neighboring words (such as nouns are preceded by determiners and adjectives, verbs by nouns) and syntactic structure (nouns are generally part of noun phrases), making POS tagging a key aspect of parsing. Parts of speech are useful features for labeling named entities like people or organisations in information extraction, or for coreference resolution. A word’s part of speech can even play a role in speech recognition or synthesis, e.g., the word content is pronounced CONtent when it is a noun and conTENT when it is an adjective.
Part-of-speech tagging is the process of assigning a POS tag to each word token in a given text. The input to a tagging algorithm is a sequence of tokenised words and the output is a sequence of tags, one per token. Particularly, words are ambiguous as they have more than one possible POS and the goal is to find the correct tag for each situation. For example, "book" can be a verb ("book that flight") or a noun ("hand me that book"). The goal of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context.
In this section we introduce the use of the Hidden Markov Model for part-of-speech tagging. The HMM can be used as a sequence tagger to assign a tag or class label to each unit in an observed sequence, thus mapping a sequence of observations to a sequence of labels. A HMM is a probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences), it computes a probability distribution over possible sequences of labels and chooses the best label sequence.
2.1) Brown corpus dataset
The Brown corpus is a common dataset in Natural Language Processing (NLP), it is available from the NLTK library and it supports POS tagged information. Run the cell below to download the dataset and the universal tagset.
First we need to get the dataset and split it into training and testing. Run the following cells to achieve that and to prepare the dataset in the correct format.
Let's explore the different types of tags by running the next cell.
The next cell, explores the structure of the dataset in terms of words and sentences.
2.2) POS tagging model
Two of the main components of HMMs are the transition model and the emission model. You already got a taste of how the two models operate in the exercise 1. Your task now is two implement both models for POS tagging model.
Specifically, the transition model will estimate the probability of the next tag given the current tag while the emission model will estimate the probability of observing a word given the current tag.
Note: the provided hints link to the ntlk.probability
functions, however you are free to use any other libraries such as numpy
.
2.2.1) Emission Model
Given the tagged words, the main steps are:
Create a list of tuples () from the input
words
which will be utilised in step 2. (careful:words
contains tuples of ())Calculate the frequency for each word given the corresponding tag. (hint: ConditionalFreqDist)
Calculate the conditional probability distribution given the frequency distribution of step 2. (hint: ConditionalProbDist with
probdist_factory
=MLEProbDist
)
Implement these steps in the cell below in a function emission_model
that has one argument, words
, containing a list of (word, tag)
pairs.
2.2.2) Transition Model
Given the tagged sentences, the main steps are:
Create chain of tuples () to be passed to
ConditionalFreqDist()
. To achieve this efficiently:Create a generator expression (similar to list comprehension but replacing
[]
with()
which iterates over each sentence and for each sentence creates a pair of () (if you are curious about generator expressions, you can read this and this)Create a chain from the ouput of the generator (hint: itertools.chain.from_iterable())
Calculate the frequency of the next tag given the previous tag. (hint: ConditionalFreqDist)
Calculate the conditional probability distribution of the next tag given the previous tag. (hint: ConditionalProbDist)
Implement these steps in the cell below in a function transition_model
that has one argument, sentences
, containing the dataset of sentences.
Utilising the previously created models, the cell below trains the emission and transition models, the former using tagged words and the later on tagged sentences.
Based on the Brown corpus, the trained HMM gives the following probabilities:
The probability of observing the world
city
given the tagNOUN
.The probability of the tag
VERB
given the tagNOUN
.
2.3) Viterbi algorithm
The goal of the Viterbi algorithm is to find the most likely sequence of hidden states for a given sequence of observations. Specifically, we will utilise the Viterbi algorithm to find the tags of a sequence of untagged sentences, known as . The benefit of this algorithm comes from its ability to efficiently determine the most probable path amongst the exponentially many possibilities. Doing so, it can reduce the complexity from to where is the total number of words and is the total number of tags.
Your task is to implement Viterbi algorithm and perform HMM decoding. Complete the missing to-do
lines.
Run the cell below to get n=10
random sentences as test data (untagged words).
Run the cell below to check your implementation of the Viterbi algorithm on the test set and its POS tagging accuracy.
Let's print the print the first 10 words and their predicted tags.
as well as the mispredicted tags:
How well does the model predict? Can you make any observations regarding the wrongly-predicted tags?
3) Robot localisation -- Optional
Another application of HMM application is robot localisation, which is the focus of this exercise.
We have a robot which is deployed to Mars to collect some samples. However, during the landing phase, the connection got temporarily lost and we are not sure where exactly the robot is. Luckily, the map of the area is available and the robot is provided with a sensor that can detect rocky surface.
The aim is to use an HMM to localise the robot where the hidden variable is the position of the robot and the observation is given by the sensor measurement. The figure at the top of the lab sheet, taken from Bishop, shows the relation between hidden variables with the observable variables .
We will achieve this through the discrete case of the Bayes Filter which consists of mainly two steps: prediction step and measurement step. Your task will be to implement these two steps which is given by the following pseudocode.
First step: where:
: the predicted belief of the new state at the current time based on the action (control ).
: the belief of state at the previous time step
: the transition probability of moving from state to state when taking action .
In the second step, the knowledge from the measurement model is combined with the predicted belief to obtain the posterior belief of each state. This posterior belief becomes the current belief for the next time step.
is also known as the likelihood, i.e. how likely it is to see an observation given the robot is in location . We need to get the likelihood before implementing the belief. Moreover is the normalising term such that the probability adds to 1.
We provide the implementation of both World and Robot class:
The robot has 4 possible actions: up, down, left and right. Each action will transition the robot in the new state with probability p_move
. At each state the robot can measure whether the surface is rocky or normal by calling measurement()
with the sensor's noise value of p_sense
.
The map is discrete and represented through grids where each grid indicates a position. As displayed below, the yellow corresponds to the rocky surface while the purple corresponds to the normal surface. The sensor measurement will output a value of 1 for the rocky surface and a value of 0 for the normal surface.
Run the following two cells to initialise the robot class and show the starting position.
The next cell provides helper functions to visualise the true robot position, the predicted belief, the likelihood and the posterior belief.
The next cell contains the main algorithm and your task is to fill the missing lines to-do
(hint: check the provided pseudocode above).
Change the p_move
and p_sense
values and re-run the experiment with different values. How these probabilities affect the performance?
Wrap-up
Congratulations, you have reached the end of this lab. Let's summarise what we have learnt:
Understand the role of the HMM components
Key applications of HMM such as POS tagger
Perform POS tagging through the transition model and the emission model.
Implemented the Viterbi algorithm and the discrete Bayes filter algorithm.
References
COMS30035 Machine Learning
Bishop - Pattern Recognition and Machine Learning: Chapter 13 - Sequential Data
University of Edinburgh's Foundations of Natural Language Processing (FNLP) and Decision Making in Robots and Autonomous Agents (DMR) courses