Path: blob/master/2 - Natural Language Processing with Probabilistic Models/Week 3/C2W3_L3_Language model generalization.ipynb
65 views
Out of vocabulary words (OOV)
Vocabulary
In the video about the out of vocabulary words, you saw that the first step in dealing with the unknown words is to decide which words belong to the vocabulary.
In the code assignment, you will try the method based on minimum frequency - all words appearing in the training set with frequency >= minimum frequency are added to the vocabulary.
Here is a code for the other method, where the target size of the vocabulary is known in advance and the vocabulary is filled with words based on their frequency in the training set.
Now that the vocabulary is ready, you can use it to replace the OOV words with as you saw in the lecture.
When building the vocabulary in the code assignment, you will need to know how to iterate through the word counts dictionary.
Here is an example of a similar task showing how to go through all the word counts and print out only the words with the frequency equal to f.
As mentioned in the videos, if there are many replacements in your train and test set, you may get a very low perplexity even though the model itself wouldn't be very helpful.
Here is a sample code showing this unwanted effect.
Add-k smoothing was described as a method for smoothing of the probabilities for previously unseen n-grams.
Here is an example code that shows how to implement add-k smoothing but also highlights a disadvantage of this method. The downside is that n-grams not previously seen in the training dataset get too high probability.
In the code output bellow you'll see that a phrase that is in the training set gets the same probability as an unknown phrase.
Back-off
Back-off is a model generalization method that leverages information from lower order n-grams in case information about the high order n-grams is missing. For example, if the probability of an trigram is missing, use bigram information and so on.
Here you can see an example of a simple back-off technique.
Interpolation
The other method for using probabilities of lower order n-grams is the interpolation. In this case, you use weighted probabilities of n-grams of all orders every time, not just when high order information is missing.
For example, you always combine trigram, bigram and unigram probability. You can see how this in the following code snippet.
That's it for week 3, you should be ready now for the code assignment.