Path: blob/master/2 - Natural Language Processing with Probabilistic Models/Week 4/C2W4_L3_Training the CBOW model.ipynb
65 views
Word Embeddings: Training the CBOW model
In previous lecture notebooks you saw how to prepare data before feeding it to a continuous bag-of-words model, the model itself, its architecture and activation functions. This notebook will walk you through:
Forward propagation.
Cross-entropy loss.
Backpropagation.
Gradient descent.
Which are concepts necessary to understand how the training of the model works.
Let's dive into it!
Forward propagation
Let's dive into the neural network itself, which is shown below with all the dimensions and formulas you'll need.
Figure 2 Set equal to 3. Remember that is a hyperparameter of the CBOW model that represents the size of the word embedding vectors, as well as the size of the hidden layer.
Also set equal to 5, which is the size of the vocabulary we have used so far.
Initialization of the weights and biases
Before you start training the neural network, you need to initialize the weight matrices and bias vectors with random values.
In the assignment you will implement a function to do this yourself using numpy.random.rand. In this notebook, we've pre-populated these matrices and vectors for you.
Check that the dimensions of these matrices match those shown in the figure above.
Before moving forward, you will need some functions and variables defined in previous notebooks. They can be found next. Be sure you understand everything that is going on in the next cell, if not consider doing a refresh of the first lecture notebook.
Training example
Run the next cells to get the first training example, made of the vector representing the context words "i am because i", and the target which is the one-hot vector representing the center word "happy".
You don't need to worry about the Python syntax, but there are some explanations below if you want to know what's happening behind the scenes.
get_training_examples, which uses theyieldkeyword, is known as a generator. When run, it builds an iterator, which is a special type of object that... you can iterate on (using aforloop for instance), to retrieve the successive values that the function generates.In this case
get_training_examplesyields training examples, and iterating ontraining_exampleswill return the successive training examples.
nextis another special keyword, which gets the next available value from an iterator. Here, you'll get the very first value, which is the first training example. If you run this cell again, you'll get the next value, and so on until the iterator runs out of values to return.In this notebook
nextis used because you will only be performing one iteration of training. In this week's assignment with the full training over several iterations you'll use regularforloops with the iterator that supplies the training examples.
The vector representing the context words, which will be fed into the neural network, is:
The one-hot vector representing the center word to be predicted is:
Now convert these vectors into matrices (or 2D arrays) to be able to perform matrix multiplication on the right types of objects, as explained in a previous notebook.
Now you will need the activation functions seen before. Again, if this feel unfamiliar consider checking the previous lecture notebook.
Values of the hidden layer
Now that you have initialized all the variables that you need for forward propagation, you can calculate the values of the hidden layer using the following formulas:
First, you can calculate the value of .
np.dotis numpy's function for matrix multiplication.
As expected you get an by 1 matrix, or column vector with elements, where is equal to the embedding size, which is 3 in this example.
You can now take the ReLU of to get , the vector with the values of the hidden layer.
Applying ReLU means that the negative element of has been replaced with a zero.
Values of the output layer
Here are the formulas you need to calculate the values of the output layer, represented by the vector :
First, calculate .
Expected output:
This is a by 1 matrix, where is the size of the vocabulary, which is 5 in this example.
Now calculate the value of .
Expected output:
As you've performed the calculations with random matrices and vectors (apart from the input vector), the output of the neural network is essentially random at this point. The learning process will adjust the weights and biases to match the actual targets better.
That being said, what word did the neural network predict?
The neural network predicted the word "happy": the largest element of is the third one, and the third word of the vocabulary is "happy".
Here's how you could implement this in Python:
print(Ind2word[np.argmax(y_hat)])
Well done, you've completed the forward propagation phase!
Cross-entropy loss
Now that you have the network's prediction, you can calculate the cross-entropy loss to determine how accurate the prediction was compared to the actual target.
Remember that you are working on a single training example, not on a batch of examples, which is why you are using loss and not cost, which is the generalized form of loss.
First let's recall what the prediction was.
And the actual target value is:
The formula for cross-entropy loss is:
Try implementing the cross-entropy loss function so you get more familiar working with numpy
Here are a some hints if you're stuck.
To multiply two numpy matrices (such as y and y_hat) element-wise, you can simply use the * operator.
Once you have a vector equal to the element-wise multiplication of y and y_hat, you can use np.sum to calculate the sum of the elements of this vector.
loss = np.sum(-np.log(y_hat)*y)
Don't forget to run the cell containing the cross_entropy_loss function once it is solved.
Now use this function to calculate the loss with the actual values of and .
Expected output:
This value is neither good nor bad, which is expected as the neural network hasn't learned anything yet.
The actual learning will start during the next phase: backpropagation.
Backpropagation
The formulas that you will implement for backpropagation are the following.
Note: these formulas are slightly simplified compared to the ones in the lecture as you're working on a single training example, whereas the lecture provided the formulas for a batch of examples. In the assignment you'll be implementing the latter.
Let's start with an easy one.
Calculate the partial derivative of the loss function with respect to , and store the result in grad_b2.
Expected output:
Next, calculate the partial derivative of the loss function with respect to , and store the result in grad_W2.
Hint: use
.Tto get a transposed matrix, e.g.h.Treturns .
Expected output:
Now calculate the partial derivative with respect to and store the result in grad_b1.
Expected output:
Finally, calculate the partial derivative of the loss with respect to , and store it in grad_W1.
Expected output:
Before moving on to gradient descent, double-check that all the matrices have the expected dimensions.
Gradient descent
During the gradient descent phase, you will update the weights and biases by subtracting times the gradient from the original matrices and vectors, using the following formulas.
First, let set a value for .
The updated weight matrix will be:
Let's compare the previous and new values of :
The difference is very subtle (hint: take a closer look at the last row), which is why it takes a fair amount of iterations to train the neural network until it reaches optimal weights and biases starting from random values.
Now calculate the new values of (to be stored in W2_new), (in b1_new), and (in b2_new).
Expected output:
Congratulations, you have completed one iteration of training using one training example!
You'll need many more iterations to fully train the neural network, and you can optimize the learning process by training on batches of examples, as described in the lecture. You will get to do this during this week's assignment.
How this practice relates to and differs from the upcoming graded assignment
In the assignment, for each iteration of training you will use batches of examples instead of a single example. The formulas for forward propagation and backpropagation will be modified accordingly, and you will use cross-entropy cost instead of cross-entropy loss.
You will also complete several iterations of training, until you reach an acceptably low cross-entropy cost, at which point you can extract good word embeddings from the weight matrices.