Quick Introduction to Okapi BM25
The problem that BM25 (Best Match 25) tries to solve is similar to that of TFIDF (Term Frequency, Inverse Document Frequency), that is representing our text in a vector space (it can be applied to field outside of text, but text is where it has the biggest presence) so we can search/find similar documents for a given document or query.
The gist behind TFIDF is that is relies on two main factors to determine whether a document is similar to our query.
Term Frequency aka tf: how often does the term occur in the document? 3 times? 10 times?
Inverse Document Frequency aka idf: measures how many documents the term appeared in. Inverse document frequency (1/df) then measures how special the term is. Is the term a very rare (occurs in just one doc) word? Or a relatively common one (occurs in nearly all the docs)?
Using these two factors, TFIDF measures the relative concentration of a term in a given piece of document. If the term is common in this article, but relatively rare elsewhere, then the TFIDF score will be high, and documents that have higher TFIDF score would be considered as very relevant to the search term.
BM25 improves upon TFIDF by casting relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant. Instead of going through how the formula was derived, here we'll take a look a the formula and try to digest it to see why it makes some kind of sense.
Gaining Intuition for Okapi BM25
BM25 (Best Match 25) function scores each document in a corpus according to the document's relevance to a particular text query. For a query , with terms , the BM25 score for document is:
where:
is the number of times term occurs in document .
is the number of words in document .
is the average number of words per document.
and are hyperparameters for BM25.
Let's break the formula down into smaller components to see why it makes sense.
First of all, there's and . should match our intuition, as this means the more times the query term appears in the document, the higher the document's score will be. The interesting part is the parameter , which determines the term frequency saturation characteristic. The higher the value, the slower the saturation. And when we say saturation, we are referring to the fact that if terms occurring extra times add extra score. We can observe this diminishing return in term frequency from the graph below.
Then part in the denominator means a document that is longer than the average documents will result in a bigger denominator, resulting in a decrease in the score. The intuition is that the more terms in the document that does not match our input query - the lower the document's score should be. In other words, if a 300 page long document mentions the query term once, it's less likely to have as much to do with the query compared to a short tweet which mentions query once.
From the graph above, we can see that shorter documents hit the asymptote much faster. Hopefully, this resembles our intuition as the more matches we have on shorter documents, the more certain we are about the relevance, whereas, for a lengthy book, it might take us longer to get to a point where we feel confident that the book is indeed relevant to the given query.
The parameter (bound 0.0 ~ 1.0) in the denominator is multiplied by the ratio of the document length we just discussed. If is bigger, the effects of the document length compared to the average length are more amplified. We can imagine if we set to 0, the effect of the length ratio would be completely nullified.
As for the inverse document frequency part, . For a corpus with documents, inverse document frequency for term is computed as follows:
where
is the number of documents in the corpus that contain term .
The inverse document frequency part is very similar to that of TFIDF, whose role is to make sure that rarer words will have a higher score and contribute more to the final score.
Please note that the IDF formula listed above has a drawback when using it for terms appearing in more than half of the corpus since the value would come out as negative value, resulting in the overall score to become negative. e.g. if we have 10 documents in the corpus, and the term "the" appeared in 6 of them, its IDF would be . Although we can argue that our implementation should have already removed these frequently appearing words as these words are mostly used to form a complete sentence and carry little meaning of note, different softwares/packages still make different adjustments to prevent a negative score from ever occurring. e.g.
Add a 1 to the equation.
For term that resulted in a negative IDF value, swap it with an small positive value, usually denoted as
Like all hyperparameters in general, the default ones are usually a good starting point, and we should probably focus on tweaking other stuff before jumping into the rabbit hole of hyperparameter tuning. In the context of search, it might be making sure our ranking scores older documents lower in application such as news ranking. But if we were to start tuning, remember to always measure the performance of various settings and the following questions are general starting points that we can reference to.
For , we should be asking, "when do we think a term is likely to be saturated?" For very long documents like books, it's very likely to have a lot of different terms appear several times in a work, even when the term isn't the primary subject of the document. We may not want terms to be saturated as quickly in this situation, so the suggestion is that should generally trend toward larger numbers when the text is a lot longer and more diverse. On the opposite side of things, it's been suggested to set on the lower side. It's very unlikely that a collection of short tweets would have a term multiple times without being highly related to that term.
For , we should be asking, "when do we think a document is likely to be very long, and when should that hinder its relevance to a term?" Documents which are highly specific like engineering specifications or patents are lengthy in order to be more specific about a subject. Their length is unlikely to be detrimental to the relevance and lower may be more appropriate. On the other end of the spectrum, documents which touch on several different topics in a broad way — news articles (a political article may touch on economics, international affairs, and certain corporations), user reviews, etc. — often benefit by choosing a larger so that irrelevant topics to a user's search, including spam and the like, are penalized.
Implementation
In the code chunk above, we printed each corpus's BM25 relevance score along with the original text, note that this isn't sorted in decreasing order of the relevance score yet, which is usually what we want to do in a real world setting. That is to find the more relevant document, sort them in decreasing order and present them to the user. Also here, we are computing the scores for every document, this becomes computationally expensive when we start have a large corpus size. Thus search engine uses inverted index to speed things up. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears, this allows us to quickly find the documents that contains the term in our query and only then do we compute the relevance score for this smaller recall set. This link contains a good high level description of this concept.
ElasticSearch BM25
We can see BM25 in action to rank documents using ElasticSearch, this notebook isn't an ElasticSearch tutorial, so hopefully, the reader are some what familiar with the tool, if not, each code chunk contains links to some helpful references.
We will follow the standard process of creating the index to store our documents, add some sample documents to the index and provide a query against the index to return the relevant documents sorted in decreasing order based on the relevance score, which will be based on BM25.