Path: blob/master/keras/rnn_language_model_basic_keras.ipynb
1470 views
Keras RNN (Recurrent Neural Network) - Language Model
Language Modeling (LM) is one of the foundational task in the realm of natural language processing (NLP). At a high level, the goal is to predict the n + 1 token in a sequence given the n tokens preceding it. A well trained language model are used in applications such as machine translation, speech recognition or to be more concrete business applications such as Swiftkey.
Language Model can operate either at the word level, sub-word level or character level, each having its own unique set of benefits and challenges. In practice word-level LMs tends to perform better than character-level LMs, but suffer from increased computational cost due to large vocabulary sizes. Apart from that it also requires more data preprocessing such as dealing with infrequent words and out of vocabulary words. On the other hand, character-level LMs do not face these issues as the vocabulary only consists of a limited set of characters. This, however, is not without drawbacks. Character-level LMs is more prone to vanishing gradient problems, as given a sentence "I am happy", a word-level LM would potentially treat this as 3 time steps (3 words/tokens), while a character-level LM would treat this as 8 time steps (8 characters), hence as the number of words/tokens in a sentence increase, the time step that the character-level LM needs to capture would be substantially higher than that of a word-level LM. To sum it up in one sentence. The distinction between word-level LMs and character-level LMs suggests that achieving state-of-art result for these two tasks often requires different network architectures and are usually not readily transferable.
Implementation
This documentation demonstrates the basic workflow of:
Preparing text for developing a word-level language model.
Train an neural network that contains an embedding and LSTM layer then used the learned model to generate new text with similar properties as the input text.
As with all text analysis, there are many preprocessing steps that needs to be done to make the corpus more ready for downstream modeling, here we'll stick to some really basic ones as this is not the main focus here. Steps includes:
We will be splitting the text into words/tokens based on spaces, and from the first few words, we can see that some words are separated by "--", hence we'll replace that with a space.
Removing punctuation marks and retain only alphabetical words.
The next step is to map each distinct word into integer so we can convert words into integers and feed them into our model later.
Recall that a language model's task is to take words and predict the word, hence a key design decision is how long the input sequence should be. There is no one size fits all solution to this problem. Here, we will split them into sub-sequences with a fixed length of 40 and map the original word to indices.
In order to test the trained model, one can compare the model's predicted word against what the actual word sequence are in the dataset.
Despite not being a perfect match, we can see that there is still a rough correspondence between the predicted token versus the actual one. To train the network which can perform better at language modeling requires a much larger corpus and more training and optimization. But, hopefully, this post has given us a basic understanding on the general process of building a language model.
The following section lists out some ideas worth trying:
Sentence-wise model. When generating the sub-sequences for the language model, we could perform a sentence detection first by splitting the documents into sentences then pad each sentence to a fixed length (length can be determined by the longest sentence length).
Simplify vocabulary. Perform further text preprocessing such as removing stop words or stemming.
Hyperparameter tuning. e.g. size of embedding layer, LSTM layer, include dropout, etc. See if a different hyperparameter setting leads to a better model. Although, if we wish to build a stacked LSTM layer using keras then some changes to the code above is required, elaborated below:
When stacking LSTM layers, rather than using the last hidden state as the output to the next layer (e.g. the Dense layer) all the hidden states will be used as an input to the subsequent LSTM layer. In other words, a stacked LSTM will have an output for every time step as oppose to 1 output across all time steps. The diagram depicts the pattern for what 2 layers would look like:
The next couple of code chunks illustrates the difference. So suppose we have two input example (batch size of 2) both having a fixed time step of 3.
Looking at the output by the LSTM layer, we can see that it outputs a single (the last) hidden state for the input sequence. If we're to build a stacked LSTM layer, then we would need to access the hidden state output for each time step. This can be done by setting return_sequences
argument to True
when defining our LSTM layer, as shown below:
When stacking LSTM layers, we should specify return_sequences = True
so that the next LSTM layer has access to all the previous layer's hidden states.
Reference
Blog: Text Generation With LSTM Recurrent Neural Networks in Python with Keras
Blog: How to Develop a Word-Level Neural Language Model and Use it to Generate Text
Blog: Keras LSTM tutorial – How to easily build a powerful deep learning language model
Blog: Understand the Difference Between Return Sequences and Return States for LSTMs in Keras