Path: blob/master/deep_learning/multi_label/nsw.ipynb
1480 views
Approximate Nearest Neighborhood Search with Navigable Small World
Performing nearest neighborhood search on embeddings has become a crucial process in many applications, such as similar image/text search. The ann benchmark contains benchmark on various approximate nearest neighborhood search algorithms/libraries and in this document, we'll take a look at one of them, Navigable Small World Graph.
Data Preparation and Model
For the embedding, we'll be training a fasttext multi-label text classification model ourselves, and using the output embedding for this example. The fasttext library has already been introduced in another post, hence we won't be going over it in detail. The readers can also swap out the data preparation and model section with the embedding of their liking.
Given the output matrix, we would like to compute each of its nearest neighbors using the compressed vectors.
For those that are more interested in using some other embeddings, replace the index_factors
with the embedding, and query_factors
with a random element from that set of embeddings, and the rest of the document should still function properly.
Navigable Small World
We'll start off by formally defining the problem. k-nearest neighbor search is a problem where given a query object we need to find the closest objects from a fixed set of objects , where is the set of all possible objects at hand.
The idea behind navigable small world is to use a graph data structure to represent these objects , where every object is represented by a vertex/node . The navigable small world graph structure is constructed by sequential addition of all elements. For every new element, we find the set of its closest neighbors using a variant of the greedy search algorithm, upon doing so, we'll then introduce a bidirectional connection between that set of neighbors and the incoming element.
Upon building the graph, searching for the closest objects to is very similar to adding objects to the graph. i.e. It involves traversing through the graph to find the closest vertices/nodes using the same variant of greedy search algorithm that's used when constructing the graph.
Another thing worth noting is that determining closest neighbors is dependent on a distance function. As the algorithm doesn't make any strong assumption about the data, it can be used on any distance function of our likings. Here we'll be using the cosine distance as an illustration.
In the original paper, the author used the term "friends" of vertices that share an edge, and "friend list" of vertex for the list of vertices that share a common with the vertex .
We'll now introduce the variant of greedy search that the algorithm uses. The pseudocode looks like the following:
Where starting from some entry point (chosen at random at the beginning), the greedy search algorithm computes a distance from the input query to each of the current entry point's friend vertices. If the distance between the query and the friend vertex is smaller than the current ones, then the greedy search algorithm will move to the vertex and repeats the process until it can't find a friend vertex that is closer to the query than the current vertex.
This approach can of course lead to local minimum, i.e. the closest vertex/object determined by this greedy search algorithm is not the actual true closest element to the incoming query. Hence, the idea to extend this is to pick a series of entry point, denoted by m
in the pseudocode below and return the best results from all those greedy searches. With each additional search, the chances of not finding the true nearest neighbors should decrease exponentially.
The key idea behind the knn search is given a random entry point, it iterates on vertices closest to the query that we've never previously visited. And the algorithm keeps greedily exploring the neighborhood until the nearest elements can't be improved upon. Then this process repeats for the next random entry point.
We'll be using the heapq
module as our priority queue.
Now that we've implemented the knn search algorithm, we can go back and modify the graph building function and use it to implement the actual way of building the navigable small world graph.
Hnswlib
We can check our result with a more advanced variant of the algorithm, Hierarchical Navigable Small World (HNSW) provided by hnswlib. The idea is very similar to skip list data structure, except we now replace its link list with nagivable small world graphs. Although we never formally introduce this new hierarchical variant, but hopefully all its major parameters should look familiar.
ef
: This algorithm searches foref
closest neighbors to the inserted element , this variable was set to in the original navigable small world paper. Theseef
closest neighbors then becomes the candidate/recall set for inserting bidirectional edges during insertion/construction phase (which is termedef_construction
) or after we're done with constructing our graph, these are our candidate/recall set for finding actual top k closest elements to the input query object.M
: After choosingef_construction
objects, onlyM
closest ones will we create edges between the enter point and those nodes. i.e. it controls the number of bi-directional links.
The actual process of constructing HNSW and doing knn search is a bit more involved compared to vanilla navigable small world. We won't be getting into all the gory details in this post.
Based on the ann benchmark, Hierarchical Navigable Small World (HNSW) stood out as one of the top performing approximate nearest neighborhood algorithms at the time of writing this document. Here, we introduced the vanilla variant of that algorithm, Navigable Small World and also matched the result with a more robust implementation from the open sourced library hnswlib.