Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
probml
GitHub Repository: probml/pyprobml
Path: blob/master/notebooks/book1/22/matrix_factorization_recommender.ipynb
1192 views
Kernel: Python 3

Open In Colab

Matrix Factorization for Movie Lens Recommendations

This notebook is based on code from Nick Becker

https://github.com/beckernick/matrix_factorization_recommenders/blob/master/matrix_factorization_recommender.ipynb

Setting Up the Ratings Data

We read the data directly from MovieLens website, since they don't allow redistribution. We want to include the metadata (movie titles, etc), not just the ratings matrix.

import pandas as pd import numpy as np import os import matplotlib.pyplot as plt
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip !ls !unzip ml-100k folder = "ml-100k"
--2021-04-20 14:51:30-- http://files.grouplens.org/datasets/movielens/ml-100k.zip Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152 Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 4924029 (4.7M) [application/zip] Saving to: ‘ml-100k.zip.1’ ml-100k.zip.1 100%[===================>] 4.70M 5.71MB/s in 0.8s 2021-04-20 14:51:31 (5.71 MB/s) - ‘ml-100k.zip.1’ saved [4924029/4924029] ml-100k ml-100k.zip ml-100k.zip.1 ml-1m ml-1m.zip sample_data Archive: ml-100k.zip replace ml-100k/allbut.pl? [y]es, [n]o, [A]ll, [N]one, [r]ename: A inflating: ml-100k/allbut.pl inflating: ml-100k/mku.sh inflating: ml-100k/README inflating: ml-100k/u.data inflating: ml-100k/u.genre inflating: ml-100k/u.info inflating: ml-100k/u.item inflating: ml-100k/u.occupation inflating: ml-100k/u.user inflating: ml-100k/u1.base inflating: ml-100k/u1.test inflating: ml-100k/u2.base inflating: ml-100k/u2.test inflating: ml-100k/u3.base inflating: ml-100k/u3.test inflating: ml-100k/u4.base inflating: ml-100k/u4.test inflating: ml-100k/u5.base inflating: ml-100k/u5.test inflating: ml-100k/ua.base inflating: ml-100k/ua.test inflating: ml-100k/ub.base inflating: ml-100k/ub.test
!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip !unzip ml-1m !ls folder = "ml-1m"
--2021-04-20 14:52:53-- http://files.grouplens.org/datasets/movielens/ml-1m.zip Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152 Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 5917549 (5.6M) [application/zip] Saving to: ‘ml-1m.zip.1’ ml-1m.zip.1 100%[===================>] 5.64M 6.81MB/s in 0.8s 2021-04-20 14:52:54 (6.81 MB/s) - ‘ml-1m.zip.1’ saved [5917549/5917549] Archive: ml-1m.zip replace ml-1m/movies.dat? [y]es, [n]o, [A]ll, [N]one, [r]ename: A inflating: ml-1m/movies.dat inflating: ml-1m/ratings.dat inflating: ml-1m/README inflating: ml-1m/users.dat ml-100k ml-100k.zip ml-100k.zip.1 ml-1m ml-1m.zip ml-1m.zip.1 sample_data
ratings_list = [ [int(x) for x in i.strip().split("::")] for i in open(os.path.join(folder, "ratings.dat"), "r").readlines() ] users_list = [i.strip().split("::") for i in open(os.path.join(folder, "users.dat"), "r").readlines()] movies_list = [ i.strip().split("::") for i in open(os.path.join(folder, "movies.dat"), "r", encoding="latin-1").readlines() ]
ratings_df = pd.DataFrame(ratings_list, columns=["UserID", "MovieID", "Rating", "Timestamp"], dtype=int) movies_df = pd.DataFrame(movies_list, columns=["MovieID", "Title", "Genres"]) movies_df["MovieID"] = movies_df["MovieID"].apply(pd.to_numeric)
movies_df.head()
def get_movie_name(movies_df, movie_id_str): ndx = movies_df["MovieID"] == int(movie_id_str) name = movies_df["Title"][ndx].to_numpy()[0] return name print(get_movie_name(movies_df, 1)) print(get_movie_name(movies_df, "527"))
Toy Story (1995) Schindler's List (1993)
def get_movie_genres(movies_df, movie_id_str): ndx = movies_df["MovieID"] == int(movie_id_str) name = movies_df["Genres"][ndx].to_numpy()[0] return name print(get_movie_genres(movies_df, 1)) print(get_movie_genres(movies_df, "527"))
Animation|Children's|Comedy Drama|War
ratings_df.head()

These look good, but I want the format of my ratings matrix to be one row per user and one column per movie. I'll pivot ratings_df to get that and call the new variable R.

R_df = ratings_df.pivot(index="UserID", columns="MovieID", values="Rating").fillna(0) R_df.head()

The last thing I need to do is de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.

R = R_df.to_numpy() user_ratings_mean = np.mean(R, axis=1) R_demeaned = R - user_ratings_mean.reshape(-1, 1) print(R.shape) print(np.count_nonzero(R))
(6040, 3706) 1000209

Singular Value Decomposition

Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function svds because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).

from scipy.sparse.linalg import svds U, sigma, Vt = svds(R_demeaned, k=50) sigma = np.diag(sigma)
latents = [10, 20, 50] errors = [] for latent_dim in latents: U, sigma, Vt = svds(R_demeaned, k=latent_dim) sigma = np.diag(sigma) Rpred = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1) Rpred[Rpred < 0] = 0 Rpred[Rpred > 5] = 5 err = np.sqrt(np.sum(np.power(R - Rpred, 2))) errors.append(err) print(errors)
(Output Hidden)

Making Predictions from the Decomposed Matrices

I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply UU, Σ\Sigma, and VTV^{T} back to get the rank k=50k=50 approximation of RR.

I also need to add the user means back to get the actual star ratings prediction.

all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

Making Movie Recommendations

Finally, it's time. With the predictions matrix for every user, I can build a function to recommend movies for any user. All I need to do is return the movies with the highest predicted rating that the specified user hasn't already rated. Though I didn't use actually use any explicit movie content features (such as genre or title), I'll merge in that information to get a more complete picture of the recommendations.

I'll also return the list of movies the user has already rated, for the sake of comparison.

preds_df = pd.DataFrame(all_user_predicted_ratings, columns=R_df.columns) preds_df.head()
def recommend_movies(preds_df, userID, movies_df, original_ratings_df, num_recommendations=5): # Get and sort the user's predictions user_row_number = userID - 1 # UserID starts at 1, not 0 sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False) # UserID starts at 1 # Get the user's data and merge in the movie information. user_data = original_ratings_df[original_ratings_df.UserID == (userID)] user_full = user_data.merge(movies_df, how="left", left_on="MovieID", right_on="MovieID").sort_values( ["Rating"], ascending=False ) print("User {0} has already rated {1} movies.".format(userID, user_full.shape[0])) print("Recommending highest {0} predicted ratings movies not already rated.".format(num_recommendations)) # Recommend the highest predicted rating movies that the user hasn't seen yet. recommendations = ( movies_df[~movies_df["MovieID"].isin(user_full["MovieID"])] .merge(pd.DataFrame(sorted_user_predictions).reset_index(), how="left", left_on="MovieID", right_on="MovieID") .rename(columns={user_row_number: "Predictions"}) .sort_values("Predictions", ascending=False) .iloc[:num_recommendations, :-1] ) return user_full, recommendations
already_rated, predictions = recommend_movies(preds_df, 837, movies_df, ratings_df, 10)
User 837 has already rated 69 movies. Recommending highest 10 predicted ratings movies not already rated.

So, how'd I do?

already_rated.head(10)
df = already_rated[["MovieID", "Title", "Genres"]].copy() df.head(10)
predictions

Pretty cool! These look like pretty good recommendations. It's also good to see that, though I didn't actually use the genre of the movie as a feature, the truncated matrix factorization features "picked up" on the underlying tastes and preferences of the user. I've recommended some film-noirs, crime, drama, and war movies - all of which were genres of some of this user's top rated movies.

Visualizing true and predicted ratings matrix

Rpred = all_user_predicted_ratings Rpred[Rpred < 0] = 0 Rpred[Rpred > 5] = 5 print(np.linalg.norm(R - Rpred, ord="fro")) print(np.sqrt(np.sum(np.power(R - Rpred, 2))))
2633.3339245908887 2633.33392459089
import matplotlib.pyplot as plt nusers = 20 nitems = 20 plt.figure(figsize=(10, 10)) plt.imshow(R[:nusers, :nitems], cmap="jet") plt.xlabel("item") plt.ylabel("user") plt.title("True ratings") plt.colorbar() plt.figure(figsize=(10, 10)) plt.imshow(Rpred[:nusers, :nitems], cmap="jet") plt.xlabel("item") plt.ylabel("user") plt.title("Predcted ratings") plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7f80ff40d850>
Image in a Jupyter notebookImage in a Jupyter notebook