Path: blob/master/recsys/calibration/calibrated_reco.ipynb
2624 views
Calibrated Recommendations
When a user has watched, say, 70% romance movies and 30% action movies in the past, then it is reasonable to expect the personalized list of recommended movies to be comprised of 70% romance and 30% action movies as well since we would like to cover the user's diverse set of interests. A recommendation that actually reflect most if not all of the user's interest is considered a Calibrated Recommendation. But the question is, does our recommendation exhibit this trait?
Recommendation algorithm provides personalized user experience past on the user's past historical interaction with the product/system/website. However, when serving the recommendation such as recommendation top 10 movies that we think the user might be interested in, a recommendation engine that is solely measured based on ranking metrics can easily generate recommendations that focus on the main area of interests, resulting the user's other area of interest to be under-represented, or worse, absent in the final recommendation.
To hit the notion home, using the example above, given a user that has watched 70% romance movies and 30% action movies, if we were to solely measure the metric based on precision, we can say we can achieve the best performance by predicting the majority genre, i.e. we will recommend 100% romance movies and we can expect the user to interact with those recommendations 70% of the time. On other other hand, if we were to recommend 70% romance movies and 30% action movies, then we would expect our recommendation to only be correct 0.7 * 0.7 + 0.3 * 0.3 = 58% of the time.
Throughout the rest of this notebook, we will take a look at if the phenomenon of crowding out user's sub-interest occurs with our recommendation, develop a quantitative metric to measure how severe this issue is and implement a post-preprocessing logic that is agnostic of the underlying recommendation algorithm we decided to use to ensure the recommendation becomes more calibrated.
Preparation
We'll be using the publicly available movielens-20m dataset throughout this experiment. We can download it via the following link. There's multiple data under that folder, we can select download all to make things easier.
The algorithm we will be using to generate our recommendation is Bayesian Personalized Ranking, which is a matrix factorization based collaborative filtering algorithm. Readers don't need to be acquainted with this model per se to continue with this notebook as the discussion is model-agnostic and we'll be explaining the syntax. That said, this link contains some resources on this algorithm if it is of interest.
Preparation steps in the next few code chunks involve the following steps:
rating.csvcontains user's rating for each movie. Here, we will focus on implicit data, and follow the usual procedure of simulating binary implicit feedback data (i.e. whether the user enjoyed the movie) by retaining only ratings of 4 stars and higher, while dropping lower ratings.movie.csvcontains each movies genre tag. We will also eliminate movies that had no genre information attached and create a mapping that stores each movies' genre distribution. In this dataset, each movie typically has several genres associated with it, thus we assign equal probabilities to each genre such that for each movie . This genre distribution will play a strong role in determining whether our recommendation is well calibrated or not.
Given this dataframe we will use the userId, movieId and rating to construct a sparse matrix, perform the random train/test split (we can split based on the time if preferred) and feed the training set into a collaborative filtering based algorithm to train the model, so we can generate item recommendations for users.
we will look at a precision_at_k metric just to make sure our recommender is reasonable, feel free to tune the model's hyperparameter to squeeze out performance, but that is not the focus here.
Deep Dive Into Calibrated Recommendation
We will take the first user as an example to see whether our recommendations are calibrated or not. Once we're familiar with the procedure for one user, we can repeat the process for all of the users if we'd like to.
Let's start of by defining the problem. We are given the distribution genres for each movie , , what we are interested is whether is similar to . Where:
is the distribution over genre of the set of movies played by user in the past.
is the distribution over genre of the set of movies we recommended to user .
For these distributions, we can have a weighted version if we liked to get sophisticated. e.g. the can be weighted by recency saying something like the item/movie interaction matters more if its a more recent interaction, indicating that item/movie's genre should also be weighted more, but let's not go there yet.
Let's first look at some code to generate these information.
For the same user, we can use the .recommend method to recommend the topn recommendation for him/her, note that we also passed in the original sparse matrix, and by default, the items/movies that the user has already played will be filtered from the list (controlled by a filter_already_liked_items argument, which defaults to True).
The next code chunk defines a function to obtain the genre distribution for a list of items. Given that we now have the list of interacted items and recommended items, we can pass it to the function to obtain the two genre distributions.
Calibration Metric
Looking at the results above, we can see that according to , the user has interacted with genres such as War, Western, however, they are nowhere to be seen in the topn recommendation to the user, hence we can argue based on the output that our recommendation might not be that well calibrated to the user's past interaction.
To scale this type of comparison, we'll now define our calibration metric . There are various methods to compare whether two distributions are similar to each other, and one popular choice is KL-divergence.
The denominator in the formula should be , but given that the formula would be undefined if and for a genre . We instead use:
with a small such as 0.01, so that .
Generating Calibrated Recommendations
Being able to compute the calibration metric between and is all well and good, but how can we generate a recommendation list that is more calibrated becomes the next important and interesting question.
Different recommendation algorithm's objective function might be completely different, thus instead of going to hard-route of incorporating it into the objective function right off the bat and spend two weeks writing the customized algorithm in an efficient manner, we will start with an alternative approach of re-ranking the predicted list of a recommender system in a post-processing step.
To determine the optimal set of recommended items, we'll be using maximum marginal relevance.
Where
is the score of the items predicted by the recommender system and , i.e. the sum of all the items' score in the recommendation list.
is a tuning parameter that determines the trade-off between the score generated by the recommender and the calibration score, notice that since the calibration score is measured by KL-divergence, which is a metric that's the lower the better we use its negative in the maximization formula.
Finding the optimal set is a combinatorial optimization problem and can be a topic by itself. We won't do a deep dive into it, but instead leverage a popular greedy submodular optimization to solve this problem. The process is as follows:
We start out with the empty set.
Iteratively append one item at a time, and at step , when we already have the set comprised of items, the item that maximizes the objective function defined above for the set is added to obtain
Repeat the process the generate the full of size .
From a theoretical standpoint, this procedure guarantees a solution that has a score of 0.63 of the optimal set.
With these information at hand, let's look at the implementation part:
In the code chunk above, we turned the knob extremely high to generate the most calibrated recommendation list possible. Let's now compare the calibrated recommendation (which only optimizes for score, ), the original recommendation and the user's interaction distribution.
Printing out the genre distribution from the calibrated recommendation list shows that this list covers more genre and its distribution closely resembles the distribution of the user's past historical interaction and our quantitative calibration metric, KL-divergence also confirms this. i.e. the calibrated recommendation's KL-divergence is lower than the original recommendation's score.
Thankfully from the results above, it seems that the re-ranked recommendation list that aims to maximize calibration score does in fact generate a more calibrated list. But the question is at what cost? Does other ranking metrics that recommender system often optimize for drop? Let's take a look at precision_at_k. Here the number for k is the topn parameter that we've defined earlier. i.e. the number of recommendations to generate for the user.
Well ..., it's not a surprise that the calibrated recommendation list's precision score is a bit disappointing compared to the original recommendation. But let's see if we try a different value of , this time turning it down a bit to strike a balance between calibration and precision.
Well, well, well. It turns out calibration can be improved considerably while accuracy is reduced only slightly if we find the sweet spot for the tuning parameter .
The following code chunk curates all the code to generate the calibrated recommendation, the original recommendation and compare it with the user's historical interaction in one place for ease of tracking the flow. This process is outlined for 1 user, feel free to modify the code to perform this comparison across all users and due to the randomness in the recommendation algorithm, the results might differ across runs, but the underlying trend should remain the same.
End Note
We have calibrated our recommendation here based on movies' genre but the same idea can be applied if we wish to calibrate our recommendation based on some other traits that matters for the problem at hand.
The approach we took here is from a user-centric perspective, i.e. we are generating calibrated recommendations for each user, we can also tackle the problem from a item-centric perspective to see if items are recommended properly, e.g. if popular items just gets recommended a lot more times than other items.
Although we used Bayesian Personalized Ranking to generated recommendations for each user, the technique that we used here to generated the calibrated ranked items for each user is independent of the recommendation algorithm we use. We're only using the algorithm here to demonstrate the process, so feel free is try this out on a recommendation algorithm of your liking/interest.
The discussion here focused on generating calibrated recommendation anchored on a user's past interaction. So if a user played 70% romance movies and 30% action movies, then when we say our recommendation is well-calibrated, that means the top ranked movies for the user should also consists of 70% romance movies and 30% action movies. We can, however, expand this work to recommend diversified items/movies. In other words, instead of limiting ourselves to only work with romance and action movies for a user that has only played movies falling under these two genres, we recommend movies from genres that are outside the user's historical play list.
To elaborate, we were using to represent the user's historical genre interaction, we can extend that distribution to , where:
Here denote a prior distribution that takes positive values for all the genres in which we would like to promote diversity in the recommendation. Again we also have a tuning parameter to control how "diversified" we would like our recommendations to be. As to how we decide which genres to promote diversity for? Well, that's a topic for another day, but to start with we can use a uniform distribution across all genres or the average over all the users' genre distribution.
After reading a lot of tutorials/papers that introduces new algorithms or new libraries/packages that boosts model performance in terms of ranking metric, coming across a paper on a different topic, calibrating recommendation, is a breath of fresh air. This sheds a new light on how I think about serving recommendations, so kudos to the original authors who brought this well-written paper to light. I encourage people to read the original paper if interested. Paper: H. Steck - Calibrated Recommendation (2018)
Side Note:
In the generating calibrated recommendation section, we introduced the maximum marginal relevance formula that has the knob for us to tune the balance between our original recommendation algorithm's score versus the calibration score. This "balancing" notion can actually be applied to many different areas. For example, In the original paper that introduced this idea, they were using this formula in the context of a search engine. To elaborate, when issuing a query on a search engine, the search engine often times assembles results that maximizes the relevance to the user query. By introducing the linear combination "marginal relevance" - a result is now said to have high marginal relevance if it is both relevant to the query and contains minimal similarity to previously selected results.
The Greedy Algorithm we leveraged to actually generate our top-N items that maximizes the maximum marginal relevance formula is a discrete optimization method that's often times referred to as a submodular optimization. Submodular optimization is a generic optimization method that can also be applied to other areas such as influence maximization and can be a topic by itself and is discussed lightly deeper in another notebook. Notebook: Submodular Optimization & Influence Maximization