Path: blob/master/1 - Natural Language Processing with Classification and Vector Spaces/Week 3/C1W3_L2_Manipulating word embeddings.ipynb
65 views
Manipulating word embeddings
In this week's assignment, you are going to use a pre-trained word embedding for finding word analogies and equivalence. This exercise can be used as an Intrinsic Evaluation for the word embedding performance. In this notebook, you will apply linear algebra operations using NumPy to find analogies between words manually. This will help you to prepare for this week's assignment.
Now that the model is loaded, we can take a look at the word representations. First, note that the word_embeddings is a dictionary. Each word is the key to the entry, and the value is its corresponding vector presentation. Remember that square brackets allow access to any entry if the key exists.
It is important to note that we store each vector as a NumPy array. It allows us to use the linear algebra operations on it.
The vectors have a size of 300, while the vocabulary size of Google News is around 3 million words!
Operating on word embeddings
Remember that understanding the data is one of the most critical steps in Data Science. Word embeddings are the result of machine learning processes and will be part of the input for further processes. These word embedding needs to be validated or at least understood because the performance of the derived model will strongly depend on its quality.
Word embeddings are multidimensional arrays, usually with hundreds of attributes that pose a challenge for its interpretation.
In this notebook, we will visually inspect the word embedding of some words using a pair of attributes. Raw attributes are not the best option for the creation of such charts but will allow us to illustrate the mechanical part in Python.
In the next cell, we make a beautiful plot for the word embeddings of some words. Even if plotting the dots gives an idea of the words, the arrow representations help to visualize the vector's alignment as well.
Note that similar words like 'village' and 'town' or 'petroleum', 'oil', and 'gas' tend to point in the same direction. Also, note that 'sad' and 'happy' looks close to each other; however, the vectors point in opposite directions.
In this chart, one can figure out the angles and distances between the words. Some words are close in both kinds of distance metrics.
Word distance
Now plot the words 'sad', 'happy', 'town', and 'village'. In this same chart, display the vector from 'village' to 'town' and the vector from 'sad' to 'happy'. Let us use NumPy for these linear algebra operations.
Linear algebra on word embeddings
In the lectures, we saw the analogies between words using algebra on word embeddings. Let us see how to do it in Python with Numpy.
To start, get the norm of a word in the word embedding.
Predicting capitals
Now, applying vector difference and addition, one can create a vector representation for a new word. For example, we can say that the vector difference between 'France' and 'Paris' represents the concept of Capital.
One can move from the city of Madrid in the direction of the concept of Capital, and obtain something close to the corresponding country to which Madrid is the Capital.
We can observe that the vector 'country' that we expected to be the same as the vector for Spain is not exactly it.
So, we have to look for the closest words in the embedding that matches the candidate country. If the word embedding works as expected, the most similar word must be 'Spain'. Let us define a function that helps us to do it. We will store our word embedding as a DataFrame, which facilitate the lookup operations based on the numerical vectors.
Now let us find the name that corresponds to our numerical country:
Predicting other Countries
However, it does not always work.
Represent a sentence as a vector
A whole sentence can be represented as a vector by summing all the word vectors that conform to the sentence. Let us see.
Congratulations! You have finished the introduction to word embeddings manipulation!