Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
DataScienceUWL
GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Lessons/Lesson 14 - RecSys 2/resources/Hybrid Recommender.ipynb
871 views
Kernel: Python 3 (system-wide)

Hybrid Recommenders

import numpy as np import pandas as pd

We aren't able to include the file for the next cell because it is 1gb and too large for Github and possibly too large for CoCalc. If you want to play with this code you'll have to download this file from the link in Banik's textbook.

#Import or compute the cosine_sim matrix # this file is large and slow to load!!! get from Banik's link cosine_sim = pd.read_csv('./data/cosine_sim.csv')
cosine_sim.head()
#Import or compute the cosine sim mapping matrix cosine_sim_map = pd.read_csv('./data/cosine_sim_map.csv', header=None) #Convert cosine_sim_map into a Pandas Series cosine_sim_map = cosine_sim_map.set_index(0) cosine_sim_map = cosine_sim_map[1]
cosine_sim.head()
cosine_sim_map.head()
0 Toy Story 0 Jumanji 1 Grumpier Old Men 2 Waiting to Exhale 3 Father of the Bride Part II 4 Name: 1, dtype: int64

Note: The surprise package changed a bit since the book was published. The code to train a model using cross validation has changed a bit as shown below. Also, notice that we aren't splitting the data into test and training and testing sets, rather we're using the whole dataset for illustration.

#Build the SVD based Collaborative filter from surprise import SVD, Reader, Dataset reader = Reader() ratings = pd.read_csv('./data/ratings_small.csv') data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader) ## OLD WAY #data.split(n_folds=5) #svd = SVD() #trainset = data.build_full_trainset() #svd.train(trainset) ## NEW WAY from surprise.model_selection import cross_validate svd = SVD() cross_validate(svd,data,cv=5,verbose=True)
Evaluating RMSE, MAE of algorithm SVD on 5 split(s). Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std RMSE (testset) 0.9040 0.8966 0.8894 0.8999 0.8988 0.8977 0.0048 MAE (testset) 0.6977 0.6912 0.6884 0.6919 0.6910 0.6920 0.0031 Fit time 3.95 3.97 3.97 3.95 4.04 3.98 0.03 Test time 0.18 0.10 0.10 0.10 0.10 0.12 0.03
{'test_rmse': array([0.90400821, 0.89663056, 0.88938934, 0.8998797 , 0.89880681]), 'test_mae': array([0.69772989, 0.69119595, 0.68844665, 0.69189097, 0.69095793]), 'fit_time': (3.94643497467041, 3.9701359272003174, 3.972831964492798, 3.9532768726348877, 4.037956237792969), 'test_time': (0.17850613594055176, 0.10392928123474121, 0.10480904579162598, 0.10410785675048828, 0.1036827564239502)}
#Build title to ID and ID to title mappings # this shouldn't be necessary in the homework since we aren't using multiple data sources the same way id_map = pd.read_csv('./data/movie_ids.csv') id_to_title = id_map.set_index('id') title_to_id = id_map.set_index('title')
#Import or compute relevant metadata of the movies smd = pd.read_csv('./data/metadata_small.csv')
def hybrid(userId, title): #Extract the cosine_sim index of the movie idx = cosine_sim_map[title] #Extract the TMDB ID of the movie tmdbId = title_to_id.loc[title]['id'] #Extract the movie ID internally assigned by the dataset movie_id = title_to_id.loc[title]['movieId'] #Extract the similarity scores and their corresponding index for every movie from the cosine_sim matrix sim_scores = list(enumerate(cosine_sim[str(int(idx))])) #Sort the (index, score) tuples in decreasing order of similarity scores sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) #Select the top 25 tuples, excluding the first #(as it is the similarity score of the movie with itself) sim_scores = sim_scores[1:25] #Store the cosine_sim indices of the top 25 movies in a list movie_indices = [i[0] for i in sim_scores] #Extract the metadata of the aforementioned movies movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']] #Compute the predicted ratings using the SVD filter movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, id_to_title.loc[x]['movieId']).est) #Sort the movies in decreasing order of predicted rating movies = movies.sort_values('est', ascending=False) #Return the top 10 movies as recommendations return movies.head(10)

The results are not identical to the results in the book since different folds are used in the training procedure.

hybrid(1, 'Avatar')
hybrid(2, 'Avatar')