Path: blob/master/deep_learning/rnn/1_tensorflow_rnn.ipynb
1480 views
Table of Contents
RNN (Recurrent Neural Network)
The idea behind RNN is to make use of sequential information that exists in our dataset. In feed forward neural network, we assume that all inputs and outputs are independent of each other. But for some tasks, this might not be the best way to tackle the problem. For example, in Natural Language Processing (NLP) applications, if we wish to predict the next word in a sentence (one business application of this is Swiftkey), we could imagine that knowing the word that comes before it can come in handy. RNNs are called recurrent because they perform the same task for every element of a sequence (sharing the weights), with the output being depended on the previous computations. Another way to think about RNNs is that they have a "memory" which captures information about what has been calculated so far.
The following diagram shows what a typical RNN-type network looks like:
We can think of RNN-type networks as networks with loops. During the forward stage, RNN is being unrolled/unfolded into a full network. By unrolling, we are referring to the fact that we will be performing the computation for the complete sequence. e.g.
If the input sequence is a sentence of 5 words, the network (RNN cell) would be unrolled into a 5-copies, one copy for each word.
If we were to consider every image's row as a sequence of pixels. For example MNIST image shape is 28*28 pixels, we would then be handling 28 time steps each having a feature size of 28 for every sample.
The formula for the computation happening in a RNN cell are as follow:
: The input at time step , taking the size of the feature space, e.g. one-hot vector or embedding of the input word.
: The hidden state at time step . This is essentially the "memory" of the network is calculated based on the previous hidden state and the input at the current step. . The function is usually a nonlinearity function such as tanh or relu. At the first step, is usually initialized to all zeros in order to calculate the first hidden state.
: is the output at step . For example, if we wish to predict the most probable word in a sentence, i.e. a classification problem then the computation can be a linear layer followed by a softmax .
A few things to note:
Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the same parameters (, , above) across all steps. This reflects the fact that we are performing the same task at each step, just with different inputs, such design greatly reduces the total number of parameters we need to learn.
The above diagram has outputs at each time step, but depending on the task this may not be necessary. For example, when in basic sequence classification, we can assume the last hidden state has accumulated the information representing the entire sequence. A concrete example might be when predicting the sentiment of a sentence we may only care about the final output, not the sentiment after each word.
Implementation
We'll use the MNIST dataset as our example dataset as it requires little preprocessing and let us focus on the algorithm at hand. Using the MNIST data from tensorflow.examples.tutorials.mnist import input_data
will raise a lot of deprecation warnings, thus we leverage keras' mnist data and implement a class to generate batches of data from it.
It can be helpful to write down the dimensions of our input and weights. Here our MNIST image's feature size is 28, the number of possible output/target is 10 and assume we've set our hidden layer size is 128 (this is a hyperparameter that we can tune, increasing it makes the "memory" capable of memorizing more complex patterns, but also results in additional computation and raise the risk of overfitting). Then we have: