Path: blob/main/Lessons/Lesson 14 - RecSys 2/Lesson_14.ipynb
870 views
Week 14: Recommender Systems 2
Collaborative Filters
Set Up
Defining Data
In Chapter 6, Banik uses the movielens dataset to explore collaborative filtering. We're going to use what's called a "toy" dataset, which is just a very small dataset. This makes it easier to see what's happening at each step, though our predictions will be worse because we have much less data to go on.
With the data loaded, our job is to predict the rating, given a user and a movie. We will do this as a regression problem. In some instances, we could view this as categorical data instead of numerical data, because we have discrete values from 1 to 5. But, since this is ordinal data (the order of the numbers has meaning), we'll treat it as continuous data. We want our regressor to "understand" that a mistaken rating of 1 when it should be 5 is a bigger mistake than a rating of 4 would be. Classification problems don't understand that nuance.
Let's split the data into train and test sets. Banik uses a hack here to stratify on the user. Stratifying on the user ensures that we have some of each user's ratings in both the train and the test set.
Since we have such a small dataset, we can explore what's in our training and test data. You can see that every user is in both the training and test data, though not in equal measure.
The variables y_train and y_test won't actually be used in our code. They're just used as a way to stratify the data. Typically you'd see y as the variable you're trying to predict. That's not how we're doing it here, since our X_train and X_test data are actually dataframes that contain both what we're using to make predictions (user_id and movie_id combination) and what we're predicting (rating). (It's a bit weird. We know.)
RMSE Metric
Our metric for evaluation will be the Root Mean Squared Error. Banik builds a wrapper function around scikit-learn's mean_squared_error function, but with recent versions of sciit-learn we can simply use root_mean_squared_error
.
We're going to hand-code a series of models. All our models will take in a user_id and a movie_id, and attempt to predict the rating. (Generalizing this, we could say that they take in a user_id and an item_id, as movies is just one thing we could use this for.)
Let's define a baseline model. Our hand-coded baseline model always returns the MEDIAN of our ratings scale (not the median of all of our user's ratings). In other words, our baseline model is trying to be as noncommittal as possible. Later you'll see how to do a different baseline model with the Surprise package that uses a random rating based on a normal distribution.
Let's walk through how to get the median of our scale using Numpy.
We're going to alter Banik's function so that it also accepts optional arguments. We don't need any for this function, but later we will need additional arguments and this keeps our coding consistent.
Next we need a way to score our model.
Here's where we diverge from Banik's approach just a bit. Instead of relying on global variables, we will explicitly pass in our data for our scoring model. Note we're again using the special parameter *args. This tells our scoring function to accept any optional arguments we might need, and we'll pass those right along to our model.
We are also going to follow the example of the Surprise package and assume that our data has 3 columns in this order:
the user id
the item id
the rating
(This means that the score method will work for any dataframe that's set up that way, regardless of the column names. It's the order that matters, not the names of the columns.)
We'll also use sklearn's built in RMSE function.
Here's the complete function.
Basic Models
Everything we've done so far is just setting us up to be able to use something more than our baseline model to do some real user-based collaborative filtering. Now let's try out some basic approaches and compare them to our baseline model.
Before we can start, though, we need to do yet more data wrangling. We need a matrix that has movies as columns and users as rows, with each user's rating for that movie at the intersection. Note that although we know that every user has rated every movie, we don't have all that data in our training set, so we still end up with some NaN values.
Mean
Note that our mean function requires the ratings_matrix argument. Here's where that *args parameter comes in. We can pass r_matrix to our score function and it gets passed along to our cf_user_mean model.
Let's look at what the cf_mean() function would return for movie 15. Movie 15 has 2 ratings: [5,2]. You can see that it returns the average of the ratings, or 3.5.
If we use or score function to get the predicted rating for the entire matrix, we can get the RMSE.
Weighted Mean
Weighted mean is going to give more weight to the users that are more similar to each other. We'll do this using cosine similarity. Let's look at the function from the book:
What this says is that the rating for each user-item combination will be the dot product of two vectors:
the similarity scores between this user and other users
the ratings of other users
and this will be divided by the sum of the similarity scores. To calculate this value, we need a cosine similarity matrix between our user's ratings.
We covered this in the video, but here's a breakdown of how we'd calculate the rating for user 4 and movie 12 (2 in the video, but we updated the IDs). We'll fold this into our weighted mean function below, but we're pulling it out here just for clarity.
With the cosine similarity matrix in hand, we can set up the weighted mean function. This function needs 2 additional arguments - the rating_matrix and the cosine similarity matrix (c_sim_matrix).
Model Based Approaches
All of the above models were relatively simple and straightforward calculations, even if the code to call them was a little convoluted.
Machine learning algorithms, on the other hand, can give us a more powerful approach, with more complicated calculations. But the Surprise package makes the code to call them surprisingly simple.
We're providing some sample code below and a walkthrough video to introduce you to using the surprise package:
Baseline: Normal Predictor
Surprise includes several baseline predictors. We'll take a look at the normal predictor. The normal predictor simply predicts a random number within your rating scale, and assumes that the ratings come from a normal distribution. If you look at the histogram of our ratings below, you can see that it's unlikely that our ratings follow a normal distribution. In fact, they don't. We generated them from a discrete uniform distribution - which is a distribution in which each of the numbers is equally likely. Given what we know about our actual ratings, we would not expect the normal predictor baseline to be a good baseline for our data. But, we'll use it anyway just to give you a feel for it.
Calling predictions
We can use our trained model to predict ratings for any user/item combination. Remember that for this particular algorithm, even unknown users or unknown items will get individual rating estimates, since the algorithm is just returning a random number anyway.
We can generate predictions for our entire dataframe by using a lambda function on each row of data. (Note that the first row of the dataframe matches our hand-coded prediction for user 1 and movie 11 above.)
K Nearest Neighbors
When Banik looked at demographics, he was using explicit data to determine what makes people "similar" and assuming that if they were similar in that respect, their taste in movies would also be similar as well. That might be a faulty assumption. Gender, occupation, or other simple characteristics may not have any bearing on how people rate movies. But, there might be some underlying trends in the data that do result in commonalities in ratings.
K Nearest Neighbors tries to uncover these commonalities by training a model on some data and identifying clusters of users of users that are "near" one another.
Specifically, what this algorithm does is:
Find the k-nearest neighbors that have rated movie m
Outputs the average rating of the k users for the movie m
The documentation for KNNBasic goes over all the parameters you can set when you're setting up the algorithm.
Note that in this toy set, since we only have a handful of neighbors, we will need to decrease the number of neighbors (k) that the algorithm takes into consideration. Otherwise, we'll just be getting the mean of all the considered ratings in each fold.
Interestingly, our RMSE for the K-nearest neighbors algorithm is actually worse than our normal predictor. This is most likely because we have a tiny dataset, or we're using the wrong k value for this data.
We can use grid search to identify the best k for this set of data. We'll set up this grid search to get an unbiased accuracy metric.
Using the grid search, we were able to get a better RMSE. Extracting the best parameters shows us that the best combination of parameters was a max k of 5 and a min k of three. By default, KNNBasic uses a min k of 1, so our grid search found that always including 3 neighbors gave a better result than sometimes including fewer than 3 neighbors.
If we wanted to use this model for predictions, we'd want to retrain the model, using the best parameters, on all of our available data. We can do that by setting the data.raw_ratings back to the complete raw ratings, setting up a full trainset, instantiating our model with the best params, and fitting it on the full trainset.
Singular-value Decomposition (SVD)
The theory behind SVD is covered in Banik's book. The very high-level concept is that it's a method that allows you to reduce the dimensions of a sparse matrix and "fill in the blanks" with predictions. Under the hood, the algorithm uses stochastic gradient descent to attempt to minimize errors. We don't expect you to understand all the intricacies. We would like you to understand a couple of the hyperparameters you can tune, which are some of the same hyperparameters in every stochastic gradient descent algorithm.
n_epochs: this is the number of times the minimization steps are performed. The higher the number of times, the longer the algorithm will work to find the minimum error. The default is 20.
learning rate: this is a number that determines how much to change the model each iteration. Think of it as how big of a step the model takes in each iteration. Too large and you may never find your minimum. Too small and your model will be very slow and could get stuck in a local minimum. The default is .005.
regularization: this is a penalty term applied to prevent model overfitting. The default is .02.
We won't demonstrate using these hyperparameters in the lesson, but you'll need them for the homework. Functionally, you'd use them the same way we used k and min_k with the basicKNN algorithm.
The code to do simple cross validation with SVD itself is extremely simple, once you've already got a suprise data object set up. Read the full documentation if you're curious.
Hybrid Recommenders
Hybrid recommenders are probably the most common recommenders you'll see in the real world, and there are many approaches to building hybrid recommenders. Banik specifically walks through a one type of recommender that combines a content-based recommender with a collaborative filter. This is a relatively simplistic hybrid recommender, but it's a good place to start, since it combines the strengths of two different recommender types.
As you might guess, we have some suggested updates to his code. All that work you did last week to set up functions is really going to help you this week. Let's pull in the functions we used last week.
The first thing we need to do is to decide which data to use for our content recommender. Since we're dealing with a toy dataset anyway, let's keep this simple and just use the genres as our soup. All we'll need to do to create our soup columns is to join the genres into a string. We'll use a simple count vectorizer with 'english' stopwords and no restriction on the max number of features.
With our similarity matrix in hand, we can fetch our most similar movies. Let's fetch the 3 most similar movies to Jumanji. Since we have such a small dataset, that means we're only eliminating one. If you look at the dataframe above, can you predict which one will be eliminated?
We already have a variety of trained algorithms we could use to predict how a certain user would rate these movies. Let's add a prediction for user 1 to our results dataframe using our trained knn_gs_algo, and sort in descending order by those predictions
Given this data, we'd want to recommend 'Balto' or 'The Wizard of Oz.' Would we want to recommend 'Treasure Island?' It's similar to 'Jumanji,' but our recommender estimates we wouldn't rate it very highly.
Putting it together
Let's wrap up what we did above in a nice hybrid function. There's multiple ways we could do this. We'll show you one of them. In this approach, we're doing our content recommender outside our hybrid and passing in the resulting recommendations. In the homework, we'll ask you to pass in all the parameters needed to run your content recommender inside your hybrid function.
Self Assessment
Follow the examples and use the code files provided from from chapters 5-7 in Hands-On Recommendation Systems with Python by Rounak Banik to do the following self-assessment exercises.
The self-assessments in this lesson will be using a subset of data from the Book-Crossing dataset. Click here for more details on the Book-Crossing dataset.
User-Based Collaborative Filter
Self-Assessment: Setting up the File
The file BX-Book-Ratings-3000.csv (found in the presentation download for this lesson) is loaded here for you, though you may need to change the file path unless you create the same folder structure. Note that book ratings have been adjusted so the scale goes from 1 to 11.
Run the cell below it to load the file and then do the following:
display the first 5 lines of the data (get familiar with the data frame)
calculate the mean book rating for all books (just to get an idea)
split the data set so that 70% of a users ratings are in the training set and 30% are in the testing set
Self-Assessment: Baseline RMSE to Assess Model Performance
Build a baseline model that assigns a neutral rating and compute the RMSE of these simple "predictions" using the testing set. Make sure to make this model accept *args so that it aligns with more complicated models.
A neutral rating would occur at the midpoint of the rating scale. Calculate the median of the rating scale to determine what the baseline model should return.
Self-Assessment: Weighted Mean User-Based Filter
Build a ratings matrix from the data frame of users, books, and ratings and build a user-based collaborative filtering model that weights mean rank using cosine similarity among users. Fit the model on the training set and compute the RMSE for this model using the test set and compare it to the RMSE of the baseline model. Is it better than baseline? (i.e. is the RMSE smaller?)
Self-Assessment: Weighted Mean Item-Based Filter
Create a new ratings matrix from the data frame of users, books, and ratings with the rows defined by books (i.e. items) and columns defined by users to build an item-based collaborative filtering model that weights mean rank using cosine similarity among items. Fit the model on the training set and compute the RMSE for this model on the test and compare it to the RMSEs of the baseline and weighted mean user-based models. Is this one better than baseline?
Self-Assessment: kNN-Based Collaborative Filter
Use the surprise library in Python to build an kNNBasic collaborative filtering model for the BX-Books ratings. Compute the average RMSE for this model from 5 cross-validations, using a k of 5. Do not tune the hyperparameters. Compare it to the RMSEs of the baseline, weighted mean user-based, and weighted mean item-based models previously obtained. Use a seed of 1.
Self-Assessment: KNNBasic Item-based Collaborative Filter
Surprise makes it easy to switch between user-based and item-based collaborative filtering. By default it's always using user-based. To switch it to item-based, you need to set the sim_options dictionary key "user_based" to False
Set up a sim_options dictionary that sets user_based to false and k to 5. Use those sim_options to instantiate your KNNBasic algorithm and run 5-fold cross validation on all the data.
Consult the documentation for examples.
Self-Assessment: Hybrid Recommender
Another kind of hybrid recommender system is one which uses multiple recommenders, but weights the recommendations of each recommender to achieve a final set of recommendations. For this self assessment, you're going to write a function that takes a ratings dataframe, a userid, 2 trained Surprise models, a weight (that's less than 1 - meaning it will be a decimal representing the percent of confidence we have in that algorithm) for your first recommender, and a number indicating how many recommendations to return.
Your recommender should do the following:
generate a dataframe of unique items from the ratings dataframe (there are multiple ways to accomplish this)
generate predicted rating for each combination of passed in userid and item using the first Surprise model (add a column to your unique items dataframe)
generate predicted rating for each combination of passed in userid and item using the second Surprise model (add another column to your unique items dataframe)
generate a final rating that multiplies the predicted rating of each model by their weight and adds them. (Use a lambda function and create a finalRating column.) Note that your weights should add up to 1, so your weight for your second recommender will be 1-the weight for your first recommender.
sort your unique items by the final rating and return the top N recommendations with their ratings.
To test your function:
Use userid 31315
Use an item-based KNNBasic algorithm. Do not set k (just let it use the default). Train it on the complete dataset (do not cross validate).
Use an SVD algorithm with all the default parameters. Train it on the entire dataset. (Do not cross-validate.)
Weight the KNNBasic algorithm with .6.
Return the top 10 book recommendations.