Path: blob/master/Natural Language Processing with Classification and Vector Spaces/Week 4 - Machine Translation and Document Search/NLP_C1_W4_lecture_nb_02.ipynb
15840 views
Hash functions and multiplanes
In this lab, we are going to practice the most important concepts related to the hash functions explained in the videos. You will be using these in this week's assignment.
A key point for the lookup using hash functions is the calculation of the hash key or bucket id that we assign for a given entry. In this notebook, we will cover:
Basic hash tables
Multiplanes
Random planes
Basic Hash tables
Hash tables are data structures that allow indexing data to make lookup tasks more efficient. In this part, you will see the implementation of the simplest hash function.
In the next cell, we will define a straightforward hash function for integer numbers. The function will receive a list of integer numbers and the desired amount of buckets. The function will produce a hash table stored as a dictionary, where keys contain the hash keys, and the values will provide the hashed elements of the input list.
The hash function is just the remainder of the integer division between each element and the desired number of buckets.
Now let's see the hash table function in action. The pretty print function (pprint()
) will produce a visually appealing output.
In this case, the bucket key must be the rightmost digit of each number.
Planes
Multiplanes hash functions are other types of hash functions. Multiplanes hash functions are based on the idea of numbering every single region that is formed by the intersection of n planes. In the following code, we show the most basic forms of the multiplanes principle. First, with a single plane:
The first thing to note is that the vector that defines the plane does not mark the boundary between the two sides of the plane. It marks the direction in which you find the 'positive' side of the plane. Not intuitive at all!
If we want to plot the separation plane, we need to plot a line that is perpendicular to our vector P
. We can get such a line using a rotation matrix.
Feel free to change the direction of the plane P
.
Now, let us see what is inside the code that color the points.
The function below checks in which side of the plane P is located the vector v
Hash Function with multiple planes
In the following section, we are going to define a hash function with a list of three custom planes in 2D.
The next function creates a hash value based on a set of planes. The output value is a combination of the side of the plane where the vector is localized with respect to the collection of planes.
We can think of this list of planes as a set of basic hash functions, each of which can produce only 1 or 0 as output.
Random Planes
In the cell below, we create a set of three random planes
The next function is similar to the side_of_plane()
function, but it evaluates more than a plane each time. The result is an array with the side of the plane of v
, for the set of planes P
Get the side of the plane of the vector [2, 2]
for the set of random planes.
Now, let us use the former function to define our multiplane hash function
Print the bucket hash for the vector v = [2, 2]
.
Note
This showed you how to make one set of random planes. You will make multiple sets of random planes in order to make the approximate nearest neighbors more accurate.
Document vectors
Before we finish this lab, remember that you can represent a document as a vector by adding up the word vectors for the words inside the document. In this example, our embedding contains only three words, each represented by a 3D array.
Congratulations! You've now completed this lab on hash functions and multiplanes!