GitHub Repository: probml/pyprobml
Path: blob/master/notebooks/book1/22/matrix_factorization_recommender.ipynb
¹¹⁹² views

Kernel: Python 3

Matrix Factorization for Movie Lens Recommendations

This notebook is based on code from Nick Becker

https://github.com/beckernick/matrix_factorization_recommenders/blob/master/matrix_factorization_recommender.ipynb

Setting Up the Ratings Data

We read the data directly from MovieLens website, since they don't allow redistribution. We want to include the metadata (movie titles, etc), not just the ratings matrix.

In [23]:

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

In [24]:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!ls
!unzip ml-100k
folder = "ml-100k"

Out[24]:

--2021-04-20 14:51:30--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip.1’

ml-100k.zip.1       100%[===================>]   4.70M  5.71MB/s    in 0.8s    

2021-04-20 14:51:31 (5.71 MB/s) - ‘ml-100k.zip.1’ saved [4924029/4924029]

ml-100k  ml-100k.zip  ml-100k.zip.1  ml-1m  ml-1m.zip  sample_data
Archive:  ml-100k.zip
replace ml-100k/allbut.pl? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test         

In [25]:

!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
!unzip ml-1m
!ls
folder = "ml-1m"

Out[25]:

--2021-04-20 14:52:53--  http://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip.1’

ml-1m.zip.1         100%[===================>]   5.64M  6.81MB/s    in 0.8s    

2021-04-20 14:52:54 (6.81 MB/s) - ‘ml-1m.zip.1’ saved [5917549/5917549]

Archive:  ml-1m.zip
replace ml-1m/movies.dat? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: ml-1m/movies.dat        
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         
ml-100k  ml-100k.zip  ml-100k.zip.1  ml-1m  ml-1m.zip  ml-1m.zip.1  sample_data

In [26]:

ratings_list = [
    [int(x) for x in i.strip().split("::")] for i in open(os.path.join(folder, "ratings.dat"), "r").readlines()
]
users_list = [i.strip().split("::") for i in open(os.path.join(folder, "users.dat"), "r").readlines()]
movies_list = [
    i.strip().split("::") for i in open(os.path.join(folder, "movies.dat"), "r", encoding="latin-1").readlines()
]

In [27]:

ratings_df = pd.DataFrame(ratings_list, columns=["UserID", "MovieID", "Rating", "Timestamp"], dtype=int)
movies_df = pd.DataFrame(movies_list, columns=["MovieID", "Title", "Genres"])
movies_df["MovieID"] = movies_df["MovieID"].apply(pd.to_numeric)

In [28]:

movies_df.head()

Out[28]:

In [29]:

def get_movie_name(movies_df, movie_id_str):
    ndx = movies_df["MovieID"] == int(movie_id_str)
    name = movies_df["Title"][ndx].to_numpy()[0]
    return name


print(get_movie_name(movies_df, 1))
print(get_movie_name(movies_df, "527"))

Out[29]:

Toy Story (1995)
Schindler's List (1993)

In [30]:

def get_movie_genres(movies_df, movie_id_str):
    ndx = movies_df["MovieID"] == int(movie_id_str)
    name = movies_df["Genres"][ndx].to_numpy()[0]
    return name


print(get_movie_genres(movies_df, 1))
print(get_movie_genres(movies_df, "527"))

Out[30]:

Animation|Children's|Comedy
Drama|War

In [31]:

ratings_df.head()

Out[31]:

These look good, but I want the format of my ratings matrix to be one row per user and one column per movie. I'll pivot ratings_df to get that and call the new variable R.

In [32]:

R_df = ratings_df.pivot(index="UserID", columns="MovieID", values="Rating").fillna(0)
R_df.head()

Out[32]:

The last thing I need to do is de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.

In [33]:

R = R_df.to_numpy()
user_ratings_mean = np.mean(R, axis=1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

print(R.shape)
print(np.count_nonzero(R))

Out[33]:

(6040, 3706)
1000209

Singular Value Decomposition

Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function svds because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).

In [34]:

from scipy.sparse.linalg import svds

U, sigma, Vt = svds(R_demeaned, k=50)
sigma = np.diag(sigma)

In [35]:

latents = [10, 20, 50]
errors = []
for latent_dim in latents:
    U, sigma, Vt = svds(R_demeaned, k=latent_dim)
    sigma = np.diag(sigma)
    Rpred = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
    Rpred[Rpred < 0] = 0
    Rpred[Rpred > 5] = 5
    err = np.sqrt(np.sum(np.power(R - Rpred, 2)))
    errors.append(err)

print(errors)

(Output Hidden)

Making Predictions from the Decomposed Matrices

I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$ , $\Sigma$ , and $V^{T}$ back to get the rank $k=50$ approximation of $R$ .

I also need to add the user means back to get the actual star ratings prediction.

In [36]:

all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

Making Movie Recommendations

Finally, it's time. With the predictions matrix for every user, I can build a function to recommend movies for any user. All I need to do is return the movies with the highest predicted rating that the specified user hasn't already rated. Though I didn't use actually use any explicit movie content features (such as genre or title), I'll merge in that information to get a more complete picture of the recommendations.

I'll also return the list of movies the user has already rated, for the sake of comparison.

In [37]:

preds_df = pd.DataFrame(all_user_predicted_ratings, columns=R_df.columns)
preds_df.head()

Out[37]:

In [38]:

def recommend_movies(preds_df, userID, movies_df, original_ratings_df, num_recommendations=5):

    # Get and sort the user's predictions
    user_row_number = userID - 1  # UserID starts at 1, not 0
    sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)  # UserID starts at 1

    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.UserID == (userID)]
    user_full = user_data.merge(movies_df, how="left", left_on="MovieID", right_on="MovieID").sort_values(
        ["Rating"], ascending=False
    )

    print("User {0} has already rated {1} movies.".format(userID, user_full.shape[0]))
    print("Recommending highest {0} predicted ratings movies not already rated.".format(num_recommendations))

    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (
        movies_df[~movies_df["MovieID"].isin(user_full["MovieID"])]
        .merge(pd.DataFrame(sorted_user_predictions).reset_index(), how="left", left_on="MovieID", right_on="MovieID")
        .rename(columns={user_row_number: "Predictions"})
        .sort_values("Predictions", ascending=False)
        .iloc[:num_recommendations, :-1]
    )

    return user_full, recommendations

In [39]:

already_rated, predictions = recommend_movies(preds_df, 837, movies_df, ratings_df, 10)

Out[39]:

User 837 has already rated 69 movies.
Recommending highest 10 predicted ratings movies not already rated.

So, how'd I do?

In [40]:

already_rated.head(10)

Out[40]:

In [41]:

df = already_rated[["MovieID", "Title", "Genres"]].copy()
df.head(10)

Out[41]:

In [42]:

predictions

Out[42]:

Pretty cool! These look like pretty good recommendations. It's also good to see that, though I didn't actually use the genre of the movie as a feature, the truncated matrix factorization features "picked up" on the underlying tastes and preferences of the user. I've recommended some film-noirs, crime, drama, and war movies - all of which were genres of some of this user's top rated movies.

Visualizing true and predicted ratings matrix

In [43]:

Rpred = all_user_predicted_ratings
Rpred[Rpred < 0] = 0
Rpred[Rpred > 5] = 5

print(np.linalg.norm(R - Rpred, ord="fro"))

print(np.sqrt(np.sum(np.power(R - Rpred, 2))))

Out[43]:

2633.3339245908887
2633.33392459089

In [44]:

import matplotlib.pyplot as plt

nusers = 20
nitems = 20

plt.figure(figsize=(10, 10))
plt.imshow(R[:nusers, :nitems], cmap="jet")
plt.xlabel("item")
plt.ylabel("user")
plt.title("True ratings")
plt.colorbar()


plt.figure(figsize=(10, 10))
plt.imshow(Rpred[:nusers, :nitems], cmap="jet")
plt.xlabel("item")
plt.ylabel("user")
plt.title("Predcted ratings")
plt.colorbar()

Out[44]:

<matplotlib.colorbar.Colorbar at 0x7f80ff40d850>

Matrix Factorization for Movie Lens Recommendations

Setting Up the Ratings Data

Singular Value Decomposition

Making Predictions from the Decomposed Matrices

Making Movie Recommendations

Visualizing true and predicted ratings matrix

Product

Resources

Company