Collaborative Filtering for Implicit Feedback Datasets
One common scenario in real-world recommendation system is we only have implicit instead of explicit user-item interaction data. To elaborate on this a little bit more, a user may be searching for an item on the web, or listening to songs. Unlike a rating data, where we have direct access to the user's preference towards an item, these type of actions do not explicitly state or quantify any preference of the user for the item, but instead gives us implicit confidence about the user's opinion.
Even when we do have explicit data, it might still be a good idea to incorporate implicit data into the model. Consider, for example, listening to songs. When users listen to music on a streaming service, they might rarely ever rate a song that he/she like or dislike. But more often they skip a song, or listen only halfway through it if they dislike it. If the user really liked a song, they will often come back and listen to it. So, to infer a user's musical taste profile, their listens, repeat listens, skips and fraction of tracks listened to, etc. might be far more valuable signals than explicit ratings.
Formulation
Recall from the previous notebook that the loss function for training the recommendation model on explicit feedback data was:
Where:
is the true rating given by user to the item
and are user u's and item i's latent factors, both are dimensional, where the number of latent factors that the user can specify
was the set of all user-item ratings
controls the regularization strength that prevents overfitting the user and item vectors
To keep it concrete, let's assume we're working music data and the value of our will consists of implicit ratings that counts the number of times a user has listened to a song (song listen count). Then new formulation becomes:
Recall that with implicit feedback, we do not have ratings anymore; rather, we have users' preferences for items. Therefore, in the new loss function, the ratings has been replaced with a preference indicating the preference of user to item . is a set of binary variables and is computed by binarizing .
We make the assumption that if a user has interacted at all with an item (), then we set to indicate that user has a liking/preference for item . Otherwise, we set . However, these assumptions comes with varying degrees of confidence. First of all, when , we assume that it should be associated with a lower confidence, as there are many reasons beyond disliking the item as to why the user has not interacted with it. e.g. Unaware of it's existence. On the other hand, as the number of implicit feedback, , grows, we have a stronger indication that the user does indeed like the item (regardless of whether he/she if buying a gift for someone else). So to measure the level of confidence mentioned above, we introduce another set of variables that measures our confidence in observing :
Where the 1 ensures we have some minimal confidence for every user-item pair, and as we observe more and more implicit feedback (as gets larger and larger), our confidence in increases accordingly. The term is a parameter that we have to specify to control the rate of the increase. This formulation makes intuitive sense when we look back at the term in the loss function. A larger means that the prediction has to be that much closer to so that term will not contribute too much to the total loss.
The implementation in the later section will be based on the formula above, but note that there are many ways in which we can tune the formulation above. For example, we can derive from differently. So instead of setting the binary cutoff at 0, we can set it at another threshold that we feel is appropriate for the domain. Similarly, there are many ways to transform into the confidence level . e.g. we can use:
Regardless of the scheme, it's important to realize that we are transforming the raw observation into two distinct representation, the preference and the confidence levels of the preference .
Alternating Least Squares
Let's assume we have users and items. Now, to solve for the loss function above, we start by treating as constant and solve the loss function with respect to . To do this, we rewrite and expand the first term in the loss function (excluding the regularization terms), part as:
Where:
represents all item row vectors vertically stacked on each other
contains element all of the preferences of the user
The diagonal matrix consists of in row/column , which is the user's confidence across all the items. e.g. if then the matrix for user will look like:
The formula above can also be used to monitor the loss function at each iteration. If we set and , the last two terms can be rewritten as . As for the first term we can leverage the fact that is 1 for all positive items, and just sum the confidence term .
Now for the derivation of the partial derivative.
The main computational bottleneck in the expression above is the need to compute for every user. Speedup can be obtained by re-writing the expression as:
Now the term becomes independent of each user and can be computed independently, next notice has only non-zero elements, where is the number of items for which . Similarly, contains only non-zero elements since is a binary transformation of . Thus the final formulation becomes:
After solving for the same procedure can be carried out to solve for giving a similar expression:
Implementation
We'll use the same movielens dataset like the previous notebook. The movielens data is not an implicit feedback dataset as the user did provide explicit ratings, but we will use it for now to test out our implementation. The overall preprocessing procedure of loading the data and doing the train/test split is the same as the previous notebook. But here we'll do it in a sparse matrix fashion.
The following train and test set split function is assuming that you're doing a train/test split using the current dataset. Though it's probably better to use time to perform the train/test split. For example, using the year 2016's data as training and the 1 first month of 2017's data as testing.
The following implementation uses some tricks to speed up the procedure. First of all, when we need to solve where is an matrix, a lot of books might write the solution as , however, in practice there is hardly ever a good reason to calculate that it that way as solving the equation is faster than finding .
The next one is the idea of computing matrix product using a outer product of each row.
The reason why this can speed things up is that the matrix product is now the sum of the outer products of the rows, where each row's computation is independent of another can be computed in the parallelized fashion then added back together!
Last but not least, is exploiting the property of scipy's Compressed Sparse Row Matrix to access the non-zero elements. For those that are unfamiliar with it, the following link has a pretty decent quick introduction. Blog: Empty rows in sparse arrays.
Ranking Metrics
Now that we've built our recommendation engine, the next important thing to do is to evaluate our model's offline performance.
Let's say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. In other words, what we're after is the top-N recommendation for each user and after recommending these items to the user, we need a way to measure whether the recommendation is useful or not. One key thing to note is that metrics such as RMSE might not be the best at assessing the quality of recommendations because the training focused on items with the most ratings, achieving a good fit for those. The items with few ratings don't mean much in terms of their impact on the loss. As a result, predictions for them will be off.
Long story short, we need metrics specifically crafted for ranking evaluation and the two most popular ranking metrics are MAP (Mean Average Precision) and NDCG (Normalized Discounted Cumulative Gain). The main difference between the two is that MAP assumes binary relevance (an item is either of interest or not, e.g. a user clicked a link, watched a video, purchased a product), while NDCG can be used in any case where we can assign relevance score to a recommended item (binary, integer or real). The relationship is just like classification and regression.
Mean Average Precision
For this section of the content, a large portion is based on the excellent blog post at the following link. Blog: What you wanted to know about Mean Average Precision. This documentation builds on top of it by carrying out the educational implementation.
Let's say that there are some users and some items, like movies, songs or jobs. Each user might be interested in some items. The client asks us to recommend a few items (the number is x) for each user. After recommending the items to the user, we need a way to measure whether the recommendation is useful or not. One way to do this is using MAP@k (Mean Average Precision at k) .
The intuition behind this evaluation metric is that:
We can recommend at most k items for each user (this is essentially the
@k
part), but we will be penalized for bad recommendationsOrder matters, so it's better to submit more certain recommendations first, followed by recommendations we are less sure about
Diving a bit deeper, we will first ignore @k
and get M out of the way. MAP is just the mean of APs, or average precision, for all users. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is MAP. So now, what is AP, or average precision?
One way to understand average precision is this way: we type something in Google and it shows us 10 results. It's probably best if all of them were relevant. If only some are relevant, say five of them, then it's much better if the relevant ones are shown first. It would be bad if first five were irrelevant and good ones only started from sixth, wouldn't it? The formula for computing AP is:
sum i=1:k of (precision at i * change in recall at i)
Where precision at i is a percentage of correct items among first i recommendations. Change in recall is 1/k if item at i is correct (for every correct item), otherwise zero. Note that this is base on the assumption that the number of relevant items is bigger or equal to k: r >= k. If not, change in recall is 1/r for each correct i instead of 1/k.
For example, If the actual items were [1 2 3 4 5] and we recommend [6 4 7 1 2]. In this case we get 4, 1 and 2 right, but with some incorrect guesses in between. Now let's say we were to compute AP@2, so only two first predictions matter: 6 and 4. First is wrong, so precision@1 is 0. Second is right, so precision@2 is 0.5. Change in recall is 0 and 0.5 (that's 1/k) respectively, so AP@2 = 0 * 0 + 0.5 * 0.5 = 0.25
After computing the average precision for this individual user, we then compute this for every single user and take the mean of these values, then that would essentially be our MAP@k, mean average precision at k. For this metric, the bigger the better.
Note that it's normal for this metric to be low. We can compare this metric with a baseline to get a sense of how well the algorithm is performing. And a nice baseline for recommendation engine is to simply recommend every user the most popular items (items that has the most user interaction)
NDCG
Suppose that on a four-point scale, we give a 0 score for an irrelevant result, 1 for a partially relevant, 2 for relevant, and 3 for perfect. Suppose also that a query is judged by one of our judges, and the first four results that the search engine returns are assessed as relevant (2), irrelevant (0), perfect (3), and relevant (2) by a judge.
The idea behind NDCG is: A recommender returns some items and we'd like to compute how good the list is. Each item has a relevance score, usually a non-negative number. That's gain (G). For items we don't have user feedback for we usually set the gain to zero. Now we add up those scores, that's cumulative gain (CG). So, the cumulative gain for the four results is the sum of the scores for each result: 2 + 0 + 3 + 2 = 7. In mathematical notations, the CG at a particular rank is defined as:
Where is the graded relevance of the result at position . As we can see from the formula, the value computed with this function is unaffected by changes in the ordering of search results, thus DCG is used in place of CG for a more accurate measure about the usefulness of results' ranking.
When evaluating rankings, we’d prefer to see the most relevant items at the top of the list, i.e the first result in our search results is more important than the second, the second more important than the third, and so on. Therefore before summing the scores we divide each by a growing number, which is the discount (D). One simple way to make position one more important than two (and so on) is to divide each score by the rank. So, for example, if the third result is "great", its contribution is (since the score for "great" is 3 , and the rank of the result is 3). If "great" were the first result, its contribution would be 3 / 1 = 3. Though in practice, it's more common to discount it using a logarithm of the item position, giving us:
There's an alternative formulation of DCG that places stronger emphasis on retrieving relevant documents:
This formula is commonly used in industry including major web search companies and data science competition platform such as Kaggle.
The final touch to this metric is Normalized (N). It's not fair to compare DCG values across queries because some queries are easier than others or result lists vary in length depending on the query, so we normalize them by: First, figure out what the best ranking score is for this result and compute DCG for that, then we divide the raw DCG by this ideal DCG to get NDCG@K, a number between 0 and 1. In our previous example, we had 2, then 0, 3, and a 2. The best arrangement of these same results would have been: 3, 2, 2, 0, that is, if the "great" result had been ranked first, followed by the two "relevant" ones, and then the "irrelevant". So we compute the DCG score for the rank 3, 2, 2, 0 to obtain our ideal DCG (IDCG) and simply perform:
to obtain our final ndcg.
The next section modifies the function API a little bit so it becomes more suitable for evaluating the recommendation engine.