Hybrid Recommenders

In [1]:

import numpy as np
import pandas as pd

We aren't able to include the file for the next cell because it is 1gb and too large for Github and possibly too large for CoCalc. If you want to play with this code you'll have to download this file from the link in Banik's textbook.

In [2]:

#Import or compute the cosine_sim matrix
# this file is large and slow to load!!! get from Banik's link
cosine_sim = pd.read_csv('./data/cosine_sim.csv')

In [3]:

cosine_sim.head()

Out[3]:

In [4]:

#Import or compute the cosine sim mapping matrix
cosine_sim_map = pd.read_csv('./data/cosine_sim_map.csv', header=None)

#Convert cosine_sim_map into a Pandas Series
cosine_sim_map = cosine_sim_map.set_index(0)
cosine_sim_map = cosine_sim_map[1]

In [5]:

cosine_sim.head()

Out[5]:

In [6]:

cosine_sim_map.head()

Out[6]:

0
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Name: 1, dtype: int64

Note: The surprise package changed a bit since the book was published. The code to train a model using cross validation has changed a bit as shown below. Also, notice that we aren't splitting the data into test and training and testing sets, rather we're using the whole dataset for illustration.

In [7]:

#Build the SVD based Collaborative filter
from surprise import SVD, Reader, Dataset

reader = Reader()
ratings = pd.read_csv('./data/ratings_small.csv')
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

## OLD WAY
#data.split(n_folds=5)
#svd = SVD()
#trainset = data.build_full_trainset()
#svd.train(trainset)

## NEW WAY
from surprise.model_selection import cross_validate
svd = SVD()
cross_validate(svd,data,cv=5,verbose=True)

Out[7]:

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9040  0.8966  0.8894  0.8999  0.8988  0.8977  0.0048  
MAE (testset)     0.6977  0.6912  0.6884  0.6919  0.6910  0.6920  0.0031  
Fit time          3.95    3.97    3.97    3.95    4.04    3.98    0.03    
Test time         0.18    0.10    0.10    0.10    0.10    0.12    0.03    

{'test_rmse': array([0.90400821, 0.89663056, 0.88938934, 0.8998797 , 0.89880681]),
 'test_mae': array([0.69772989, 0.69119595, 0.68844665, 0.69189097, 0.69095793]),
 'fit_time': (3.94643497467041,
  3.9701359272003174,
  3.972831964492798,
  3.9532768726348877,
  4.037956237792969),
 'test_time': (0.17850613594055176,
  0.10392928123474121,
  0.10480904579162598,
  0.10410785675048828,
  0.1036827564239502)}

In [8]:

#Build title to ID and ID to title mappings
# this shouldn't be necessary in the homework since we aren't using multiple data sources the same way
id_map = pd.read_csv('./data/movie_ids.csv')
id_to_title = id_map.set_index('id')
title_to_id = id_map.set_index('title')

In [9]:

#Import or compute relevant metadata of the movies
smd = pd.read_csv('./data/metadata_small.csv')

In [10]:

def hybrid(userId, title):
    #Extract the cosine_sim index of the movie
    idx = cosine_sim_map[title]
    
    #Extract the TMDB ID of the movie
    tmdbId = title_to_id.loc[title]['id']
    
    #Extract the movie ID internally assigned by the dataset
    movie_id = title_to_id.loc[title]['movieId']
    
    #Extract the similarity scores and their corresponding index for every movie from the cosine_sim matrix
    sim_scores = list(enumerate(cosine_sim[str(int(idx))]))
    
    #Sort the (index, score) tuples in decreasing order of similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    #Select the top 25 tuples, excluding the first 
    #(as it is the similarity score of the movie with itself)
    sim_scores = sim_scores[1:25]
    
    #Store the cosine_sim indices of the top 25 movies in a list
    movie_indices = [i[0] for i in sim_scores]

    #Extract the metadata of the aforementioned movies
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    
    #Compute the predicted ratings using the SVD filter
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, id_to_title.loc[x]['movieId']).est)
    
    #Sort the movies in decreasing order of predicted rating
    movies = movies.sort_values('est', ascending=False)
    
    #Return the top 10 movies as recommendations
    return movies.head(10)

The results are not identical to the results in the book since different folds are used in the training procedure.

In [11]:

hybrid(1, 'Avatar')

Out[11]:

In [12]:

hybrid(2, 'Avatar')

Out[12]:

Hybrid Recommenders

Product

Resources

Company