Path: blob/master/recsys/ann_benchmarks/ann_benchmarks.ipynb
2615 views
Approximate Nearest Neighbor Search
Approximate nearest neighbor (ANN) search is useful when we have a large dataset with hundred thousands/millions/billions of data-points, and for a given data point we wish to find its nearest neighbors. There are many use case for this type of methods and the one we'll be focusing on here is finding similar vector representations, so think algorithms such as matrix factorization or word2vec that compresses our original data into embeddings, or so called latent factors. And throughout the notebook, the notion of similar here will be referring to two vectors' cosine distance.
There are many open-source implementations already that we can use to see whether it solves our problem, but the question is always which one is better? The following github repo contains a thorough benchmarks of various open-sourced implementations. Github: Benchmarking nearest neighbors.
The goal of this notebook shows how to run a quicker benchmark ourselves without all the complexity. The repo listed above benchmarks multiple algorithms on multiple datasets using multiple hyperparameters, which can take a really long time. We will pick one of the open-source implementation that has been identified as a solid choice and walk through step-by-step of the process using one dataset.
Setting Up the Data
The first step is to get our hands on some data and split it into training and test set, here we'll be using the glove vector representation trained on twitter dataset.
Benchmarking an approximate nearest neighbor method involves looking at how much faster it is compared to exact nearest neighbor methods and how much precision/recall are we losing for the speed that was gained. To measure this, we first need to use an exact nearest neighbor methods to see how long it takes and store the ground truth. e.g. if out exact nearest neighbor methods, thinks that for data point 1, its top 3 nearest neighbors excluding itself are [2, 4, 1], and our approximate nearest neighbor method returns [2, 1, 5], then our precision/recall depending on which way we're looking at it would be 66%, since 2 and 1 are both in the ground truth set whereas 5 is not.
Benchmarking ANN Methods
The library that we'll be leveraging here is nmslib, specifically the algorithm HNSW (Hierarchical Navigable Small World), a graph-based approximate nearest neighborhood search method, we will only be using the library and will not be introducing the details of the algorithm in this notebook.
Like a lot of machine learning algorithms, there are hyperparameters that we can tune. We will pick a random one for now and look at the influence of each hyperparameters in later section.
we'll first use the first element from the ground truth to show-case what we'll be doing before scaling it to all the data points.
The next few code chunks experiments with different parameters to see which one works better for this use-case.
Recommended by the author of package, the most influential parameters are M and efConstruction.
efConstruction: Increasing this value improves the quality of the constructed graph and leads to a higher search accuracy, at the cost of longer indexing time. The same idea applies to theeforefSearchparameter that we can pass toquery_params. Reasonable range for this parameter is 100-2000.M: This parameter controls the maximum number of neighbors for each layer. Increasing the values of this parameters (to a certain degree) leads to better recall and shorter retrieval times (at the expense of longer indexing time). Reasonable range for this parameter is 5-100.
Other parameters include indexThreadQty (we can explicitly set the number of threads) and post. The post parameter controls the amount of post-processing done to the graph. 0, which means no post-processing. Additional options are 1 and 2 (2 means more post-processing).
Based on the result, we can see that larger values of parameters M and efConstruction does give better precision scores. Another observation is that the result for efConstruction = 100 is on-par with efConstruction = 400 and only one third of the time to build the index.