Path: blob/master/2 - Natural Language Processing with Probabilistic Models/Week 3/C2W3_L2_Building the language model.ipynb
65 views
Building the language model
Count matrix
To calculate the n-gram probability, you will need to count frequencies of n-grams and n-gram prefixes in the training dataset. In some of the code assignment exercises, you will store the n-gram frequencies in a dictionary.
In other parts of the assignment, you will build a count matrix that keeps counts of (n-1)-gram prefix followed by all possible last words in the vocabulary.
The following code shows how to check, retrieve and update counts of n-grams in the word count dictionary.
The next code snippet shows how to merge two tuples in Python. That will be handy when creating the n-gram from the prefix and the last word.
In the lecture, you've seen that the count matrix could be made in a single pass through the corpus. Here is one approach to do that.
The probability matrix now helps you to find a probability of an input trigram.
In the code assignment, you will be searching for the most probable words starting with a prefix. You can use the method str.startswith to test if a word starts with a prefix.
Here is a code snippet showing how to use this method.
Language model evaluation
Train/validation/test split
In the videos, you saw that to evaluate language models, you need to keep some of the corpus data for validation and testing.
The choice of the test and validation data should correspond as much as possible to the distribution of the data coming from the actual application. If nothing but the input corpus is known, then random sampling from the corpus is used to define the test and validation subset.
Here is a code similar to what you'll see in the code assignment. The following function allows you to randomly sample the input data and return train/validation/test subsets in a split given by the method parameters.
That's all for the lab for "N-gram language model" lesson of week 3.