Alternating Least Squares with Weighted Regularization
Recommendation system is a popular topic in recent years, what is does (or its goal) is to seek to predict the "rating" or "preference" that a user would give to an item. Given the rating, the business will then recommend preferable new items to the user for further purchases. The type of recommendation system that we will be discussing is known as collaborative filtering, where the features of the user (e.g. age, gender) or the item (e.g. perishable or not) itself do not play any role in the algorithm. Instead, we rely solely on the ratings that a user gave to the item.
We start by loading some sample data to make this a bit more concrete. For this introduction, we'll be using the MovieLens dataset. The website has datasets of various sizes, but we just start with the smallest one MovieLens 100K Dataset. This dataset consists of 100,000 movie ratings by users (on a 1-5 scale). The main data set u.data
file consists of four columns (tab-delimited), including user-id (starting at 1), item-id (starting at 1), rating, and timestamp (we won't be using this field).
As we can see, the only data that we have is how each user interacted or rated each item. Given this information, collaborative filtering will start by constructing a user-item matrix with each distinct user being the row, item being the column and the value for each cell will simply be the rating that the user gave to the item. Apart from building the matrix, we will also print out some other information to help us understand our data a bit better.
From the information above, we know there are 943 unique users, 1682 unique items. Within the rating matrix, only 6.3% of the cells have a value, although we filled in empty ratings as 0, we should not assume these values to truly be zero. More appropriately, they are just missing entries. This kind of sparsity is extremely common in recommendation system, where people seldom give ratings to things that they have purchased. One thing to note is that if the sparsity of the matrix is below 1% (rule of thumb), then the dataset might be too sparse to perform any sort of modeling.
Next the histogram tells us that every user has given at least more than 10 ratings, we will use this to perform the train/test split of the data for testing the algorithm that we'll build later.
One tricky thing about splitting the data into training and testing is that: In supervise machine learning we normally build the trainining and testing holdout set by randomly splitting the rows. In those cases, this idea works, because we have a model with features/target that we are trying to fit a function to. For recommender systems with collaborative filtering (no features), this just won't work anymore, because all of the items/users need to be available when the model is first built. So what we do instead is mask a random sample of the user/item ratings to validate and compare how well the recommender system did in predicting the ratings of those masked values. In our case, given we already know each user has given more than 10 ratings, what we'll do is for every user, we remove 10 of the item ratings and and assign them to the test set. The testing matrix is printed below, as hopefully, you can see that some of the values are indeed different from the original rating matrix.
Matrix Factorization
Now that the data preprocessing part has been taken care of, let's get to the more exciting part, the algorithm. The algorithm that we'll introduce today is Matrix Factorization.
Recall that we had a user-item matrix, where nonzero elements of the matrix are ratings that a user has given an item. What Matrix Factorization does is it decomposes a large matrix into products of matrices, namely, . See the picture below taken from a quick google search for a better understanding:
Matrix factorization assumes that:
Each user can be described by features. For example, feature 1 might be a referring to how much each user likes disney movies.
Each item, movie in this case, can be described by an analogous set of features. To correspond to the above example, feature 1 for the movie might be a number that says how close the movie is to a disney movie.
After learning the two smaller matrices, all we have to do is to perform a matrix multiplication of the two matrices and the end result will be a our approximation for the rating the user would give that item (movie).
The cool thing about this is that, we do not know what these features are nor do we have to determine them beforehand, which is why these features are often refer to as latent features. Next, we also don't know how many latent features are optimal for the task at hand. Thus, we can use random search or grid search or other fancy techniques to perform hyperparameter tuning and determine the best number for .
Given all those information, the next question is: how do we learn the user matrix, , and item matrix, ? Well, like a lot of machine learning algorithm, by minimizing a loss function.
We start by denoting our feature user into math by letting a user take the form of a -dimensional vector . These for often times referred to as latent vectors or low-dimensional embeddings. Similarly, an item i can be represented by a -dimensional vector . And the rating that we predict user will give for item is just the dot product of the two vectors
Where represents our prediction for the true rating . Next, we will choose a objective function to minimize the square of the difference between all ratings in our dataset () and our predictions. This produces a objective function of the form:
Note that we've added on two regularization terms, with controlling the strength at the end to prevent overfitting of the user and item vectors. , is another hyperparameter that we'll have to search for to determine the best value. The concept of regularization can be a topic of itself, and if you're confused by this, consider checking out this documentation that covers it a bit more.
Now that we formalize our objective function, we'll introduce the Alternating Least Squares with Weighted Regularization (ALS-WR) method for optimizing it. The way it works is we start by treating one set of latent vectors as constant. For this example, we'll pick the item vectors, . We then take the derivative of the loss function with respect to the other set of vectors, the user vectors, and solve for the non-constant vectors (the user vectors).
To clarify it a bit, let us assume that we have users and items, so our ratings matrix is .
The row vector represents users u's row from the ratings matrix with all the ratings for all the items (so it has dimension )
We introduce the symbol , with dimensions , to represent all item row vectors vertically stacked on each other
Lastly, is the identity matrix which has dimension to ensure the matrix multiplication's dimensionality will be correct when we add the
Now comes the alternating part: With these newly updated user vectors in hand, in the next round, we hold them as constant, and take the derivative of the loss function with respect to the previously constant vectors (the item vectors). As the derivation for the item vectors is quite similar, we will simply list out the end formula:
Then we alternate back and forth and carry out this two-step process until convergence. The reason we alternate is, optimizing user latent vectors, , and item latent vectors simultaneously is hard to solve. If we fix or and tackle one problem at a time, we potentially turn it into a easier sub-problem. Just to summarize it, ALS works by:
Initialize the user latent vectors, , and item latent vectors with randomly
Fix and solve for
Fix and solve for
Repeat step 2 and 3 until convergence
Now that we have our equations, let's program this thing up! We'll set the model to use 20 factors and a regularization value of 0.01 (chosen at random) and train it for 100 iterations and compute the mean square error on the train and test set to assess algorithm convergence.
Notice that ALS converges after a few sweeps, which is one of the main reason for its popularity. Fast, thus scalable to bigger dataset.