Bayesian Personalized Ranking
If you are new to the field recommendation system, please make sure you understand the basics of matrix factorization from this other documentation.
Recall that when doing matrix factorization for implicit feedback data (users' clicks, view times), we start with a user-item matrix, where nonzero elements of the matrix are the users' interaction with the items. And what matrix factorization does is it decomposes a large matrix into products of matrices, namely, .
Matrix factorization assumes that:
Each user can be described by features. For example, feature 1 might be a referring to how much each user likes disney movies.
Each item, movie in this case, can be described by an analagous set of features. To correspond to the above example, feature 1 for the movie might be a number that says how close the movie is to a disney movie.
With that notion in mind, we can denote our feature user into math by letting a user take the form of a -dimensional vector . Similarly, an item i can be represented by a -dimensional vector . And we would predict the interactions that the user will have for item is by doing a dot product of the two vectors
Where represents our prediction for the true interactions . Next, we will choose a objective function to minimize the square of the difference between all interactions in our dataset () and our predictions. This produces a objective function of the form:
This is all well and good, but a lot of times, what we wish to optimize for is not the difference between the true interaction and the predicted interaction, but instead is the ranking of the items. Meaning given a user, what is the top-N most likely item that the user prefers. And this is what Bayesian Personalized Ranking (BPR) tries to accomplish. The idea is centered around sampling positive (items user has interacted with) and negative (items user hasn't interacted with) items and running pairwise comparisons.
Formulation
Suppose is the set of all users and is the set of all items, our goal is to provide user with a personalized ranking, denoted by . As mentioned in the last section, the usual approach for item recommendation algorithm is to predict a personalized score for an item that reflects the user's preference for that item. Then the items are ranked by sorting them according to that score and the top-N is recommended to the user.
Here we'll use a different approach by using item pairs as training data and optimize for correctly ranking item pairs. From , the whole dataset we try to reconstruct for each user parts of . If the user has interacted with item , i.e. , then we assume that the user prefers this item over all other non-observed items. E.g. in the figure below user has interacted with item but not item , so we assume that this user prefers item over : . We will denote this generally as , where the stands for the positive item and is for the negative item. For items that the user have both interacted with, we cannot infer any preference. The same is true for two items that a user has not interacted yet (e.g. item and for user ).
Given these information, we can now get to the Bayesian part of this method. Let be the parameter of the model that determines the personalized ranking. BPR's goal is to maximize the posterior probability:
is the likelihood function, it captures the individual probability that a user really prefers item over item . We compute this probability with the form:
Where: is the good old logistic sigmoid:
And captures the relationship between user , item and item , which can be further decomposed into:
For convenience we skipped the argument from . The formula above is basically saying what is the difference between the predicted interaction between the positive item and the negative item . Now because of this generic framework, we can apply any standard collaborative filtering (such as the matrix factorization) techniques that can predict the interaction between user and item. Keep in mind that although it may seem like we're using the same models as in other work, here we're optimizing against another criterion as we do not try to predict a single predictor but instead tries to classify the difference of two predictions . For those that interested in diving deeper, there's a section in the original paper that showed that the BPR optimization criteria is actually optimizing AUC (Area Under the ROC curve).
So far, we have only discussed the likelihood function. In order to complete the Bayesian modeling approach of the personalized ranking task, we introduce a general prior density which is a normal distribution with zero mean and variance-covariance matrix , to reduce the number of unknown hyperparameters, we set:
To sum it all up, the full form of the maximum posterior probability optimization (called BPR-Opt in the paper) can be specified as:
Where:
We first take the natural log (it's a monotonic transformation, meaning taking the log does not affect the optimization process) to make the product a sum to make life easier on us
As for the part, recall that because for each parameter we assume that it's a normal distribution with mean zero (), and unit variance (, we ignore the for now), the formula for it is:
In the formula above, the only thing that depends on is the part on the right, the rest is just a multiplicative constant that we don't need to worry about, thus if we take the natural log of that formula, then the exponential goes away and our can be written as , and we simply multiply the back, which can be seen as the model specific regularization parameter.
Last but not least, in machine learning it's probably more common to try and minimize things, thus we simply flip all the signs of the maximization formula above, leaving us with:
Optimization
In the last section, we have derived an optimization criterion for personalized ranking. As the criterion is differentiable, gradient descent based algorithms are an obvious choice for maximization. But standard gradient descent is probably not the right choice for our problem given the size of all the possible combinations of the triplet . To solve for this issue we use a stochastic gradient descent algorithm that chooses the triplets randomly (uniformly distributed).
To solve for the function using gradient descent, we derive the gradient for the three parameters , , separately. Just a minor hint when deriving the gradient, remember that the first part of the formula requires the chain rule:
After deriving the gradient the update for each parameter using vanilla gradient descent is:
Where is the learning rate.
Implementation
We will again use the movielens data as an example.
Because BPR assumes binary implicit feedback (meaing there's only positive and negative items), here we'll assume that an item is positive only if he/she gave the item a ratings above 3 (feel free to experiment and change the threshold). The next few code chunks, creates the sparse interaction matrix and split into train and test set.
The following section provides a implementation of the algorithm from scratch.
Evaluation
In recommender systems, we are often interested in how well the method can rank a given set of items. And to do that we'll use AUC (Area Under ROC Curve as our evaluation metric. The best possible value that the AUC evaluation metric can take is 1, and any non-random ranking that makes sense would have an AUC > 0.5. An intuitive explanation of AUC is it specifies the probability that when we draw two examples at random, their predicted pairwise ranking is correct. The following documentation has a more detailed discussion on AUC in case you're not familiar with it.
Item Recommendations
Given the trained model, we can get the most similar items by using the get_similar_items
method, we can specify the number of most similar items by specifying the N
argument. And this can be seen as the people who like/buy this also like/buy this functionality, since it's recommending similar items for a given item.
On the other hand, we can also generate a top-N recommended item for each given user, by passing the sparse rating data and N
to the recommend
method.
For these two methods, we can go one step further and look-up the actual item for these indices to see if they make intuitive sense. If we wish to do this, the movielens dataset has a u.item
file that contains metadata about the movie.