GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Lessons/Lesson 14 - RecSys 2/Lesson_14.ipynb
⁸⁷⁰ views

Kernel: Python 3 (system-wide)

In [3]:

# EXECUTE FIRST

# computational imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Reader, Dataset, KNNBasic, NormalPredictor,BaselineOnly,KNNWithMeans,KNNBaseline
from surprise import SVD, SVDpp, NMF, SlopeOne, CoClustering
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import accuracy

import random
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
plt.style.use('ggplot')

# display imports
from IPython.display import display, IFrame
from IPython.core.display import HTML

# for playing videos, customize height and width if desired
# for best results keep 16:9 ratio
def play_video(vid_name, w=640, h=360):
    from IPython.display import display, IFrame
    from IPython.core.display import HTML
    vid_path = "https://media.uwex.edu/content/ds/ds775_r19/" + vid_name + "/index.html"
    print(vid_path)
    hlink = '<a href = ' + vid_path + ' target = """_blank""">Open video in new tab</a>'
    display(IFrame(vid_path, width=w, height=h))
    display(HTML(hlink))

Week 14: Recommender Systems 2

Collaborative Filters

In [4]:

play_video('ds775_lesson14_collaborative-filters-introduction')

Out[4]:

Set Up

Defining Data

In Chapter 6, Banik uses the movielens dataset to explore collaborative filtering. We're going to use what's called a "toy" dataset, which is just a very small dataset. This makes it easier to see what's happening at each step, though our predictions will be worse because we have much less data to go on.

In [5]:

movies = pd.DataFrame({'movie_id': [11,12,13,14,15],
                      'title':['Jumanji','Balto','Treasure Island', 'Pocahontas', 'The Wizard of Oz'],
                      'genres': [['adventure', 'fantasy','family'], ['family', 'animation','adventure'], ['adventure','family'], ['adventure', 'animation', 'drama', 'family'], ['adventure', 'family','fantasy']]
                      })

display(movies.head())

#generate a rating for each user/movie combination
ratings = pd.DataFrame(np.array(np.meshgrid([1, 2, 3,4,5], [11,12,13,14,15])).T.reshape(-1,2), columns=['user_id', 'movie_id'])
np.random.seed(1)
randratings = np.random.randint(1,6, ratings.shape[0])

ratings['rating'] = randratings

#we have 5 * 5 or 25 rows of data in the ratings, but we'll just look at the first 10
ratings.head(10)

Out[5]:

With the data loaded, our job is to predict the rating, given a user and a movie. We will do this as a regression problem. In some instances, we could view this as categorical data instead of numerical data, because we have discrete values from 1 to 5. But, since this is ordinal data (the order of the numbers has meaning), we'll treat it as continuous data. We want our regressor to "understand" that a mistaken rating of 1 when it should be 5 is a bigger mistake than a rating of 4 would be. Classification problems don't understand that nuance.

Let's split the data into train and test sets. Banik uses a hack here to stratify on the user. Stratifying on the user ensures that we have some of each user's ratings in both the train and the test set.

In [6]:

#Import the train_test_split function
# from sklearn.model_selection import train_test_split

#Assign X as the original ratings dataframe and y as the user_id column of ratings.
X = ratings.copy()
y = ratings['user_id']

#Split into training and test datasets, stratified along user_id
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify=y, random_state=42)

Since we have such a small dataset, we can explore what's in our training and test data. You can see that every user is in both the training and test data, though not in equal measure.

In [7]:

#compare X_train to X_test
display(X_train)
display(X_test)

Out[7]:

The variables y_train and y_test won't actually be used in our code. They're just used as a way to stratify the data. Typically you'd see y as the variable you're trying to predict. That's not how we're doing it here, since our X_train and X_test data are actually dataframes that contain both what we're using to make predictions (user_id and movie_id combination) and what we're predicting (rating). (It's a bit weird. We know.)

RMSE Metric

Our metric for evaluation will be the Root Mean Squared Error. Banik builds a wrapper function around scikit-learn's mean_squared_error function, but with recent versions of sciit-learn we can simply use root_mean_squared_error.

In [8]:

#Import the mean_squared_error function
# from sklearn.metrics import root_mean_squared_error

#test data
test_y_true = [3, -0.5, 2, 7]
test_y_pred = [2.5, 0.0, 2, 8]

#this returns MSE (not what we want)
print(root_mean_squared_error(test_y_true, test_y_pred))

Out[8]:

0.6123724356957945

We're going to hand-code a series of models. All our models will take in a user_id and a movie_id, and attempt to predict the rating. (Generalizing this, we could say that they take in a user_id and an item_id, as movies is just one thing we could use this for.)

Let's define a baseline model. Our hand-coded baseline model always returns the MEDIAN of our ratings scale (not the median of all of our user's ratings). In other words, our baseline model is trying to be as noncommittal as possible. Later you'll see how to do a different baseline model with the Surprise package that uses a random rating based on a normal distribution.

Let's walk through how to get the median of our scale using Numpy.

In [9]:

#explaining getting the median....
print('Our min is', np.min(ratings['rating']))
print('Our max plus one is', np.max(ratings['rating']) + 1)
print('If we do not plus one the arange, we only get our max - 1', np.arange(np.min(ratings['rating']), (np.max(ratings['rating']))))
print('If we plus it, we get the actual max', np.arange(np.min(ratings['rating']), (np.max(ratings['rating']) + 1)))

print('The median of the scale is then ', np.median(np.arange(np.min(ratings['rating']), (np.max(ratings['rating']) + 1))))

Out[9]:

Our min is 1
Our max plus one is 6
If we do not plus one the arange, we only get our max - 1 [1 2 3 4]
If we plus it, we get the actual max [1 2 3 4 5]
The median of the scale is then  3.0

We're going to alter Banik's function so that it also accepts optional arguments. We don't need any for this function, but later we will need additional arguments and this keeps our coding consistent.

In [10]:

#first determine the median of our ratings (we could have done this by hand, but numpy does it so well... )
print(f"The median of this rating range is {np.median(np.arange(np.min(ratings['rating']), (np.max(ratings['rating']) + 1)))}")

#define a baseline model to always return the median
def baseline(user_id, movie_id, scale_median, *args):
    '''
    Parameters:
    user_id, movie_id, *args: None of which are used, but are included for consistency
    scale_median: the median of the rating scale

    Returns:
    the median of the rating scale
    '''
    return scale_median

Out[10]:

The median of this rating range is 3.0

Next we need a way to score our model.

Here's where we diverge from Banik's approach just a bit. Instead of relying on global variables, we will explicitly pass in our data for our scoring model. Note we're again using the special parameter *args. This tells our scoring function to accept any optional arguments we might need, and we'll pass those right along to our model.

We are also going to follow the example of the Surprise package and assume that our data has 3 columns in this order:

the user id
the item id
the rating

(This means that the score method will work for any dataframe that's set up that way, regardless of the column names. It's the order that matters, not the names of the columns.)

We'll also use sklearn's built in RMSE function.

Here's the complete function.

In [11]:

#Function to compute the RMSE score obtained on the testing set by a model
def score(cf_model, X_test, *args):
    '''
    Parameters:
    cf_model: This is the model that will be used to predict ratings
    X_test: This is the data (a dataframe of user_id, item_id, rating) for which we'll predict ratings
    *args: any additional values that should be passed to the cf_model
    Returns:
    The final mean_squared_error
    '''
    
    #Construct a list of user-movie tuples from the testing dataset
    id_pairs = zip(X_test[X_test.columns[0]], X_test[X_test.columns[1]])
    
    #Predict the rating for every user-item tuple
    y_pred = np.array([cf_model(user, item, *args) for (user, item) in id_pairs])
    
    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test[X_test.columns[2]])

    #Return the final RMSE score
    return root_mean_squared_error(y_true, y_pred)
                              
#let's test it with our baseline model
score(baseline, X_test, 3)

Out[11]:

1.3093073414159542

Basic Models

Everything we've done so far is just setting us up to be able to use something more than our baseline model to do some real user-based collaborative filtering. Now let's try out some basic approaches and compare them to our baseline model.

Before we can start, though, we need to do yet more data wrangling. We need a matrix that has movies as columns and users as rows, with each user's rating for that movie at the intersection. Note that although we know that every user has rated every movie, we don't have all that data in our training set, so we still end up with some NaN values.

In [12]:

#Build the ratings matrix using pivot_table function
r_matrix = X_train.pivot_table(values='rating', index='user_id', columns='movie_id')

r_matrix.head()

Out[12]:

Mean

Note that our mean function requires the ratings_matrix argument. Here's where that *args parameter comes in. We can pass r_matrix to our score function and it gets passed along to our cf_user_mean model.

In [13]:

#User Based Collaborative Filter using Mean Ratings
def cf_mean(user_id, item_id, ratings_matrix, scale_median):
    '''
    Parameters:
    user_id: We don't use this variable, but it's passed through for consistency with other models
    item_id: The item for which we are generating predicted rating
    ratings_matrix: The matrix of ratings with items as columns and users as rows
    scale_median: The median value of the rating scale, which will be used as the default value.
    Returns:
    The mean rating
    
    '''
    #Check if item_id exists in ratings_matrix
    if item_id in ratings_matrix:
        #Compute the mean of all the ratings given to the item
        mean_rating = ratings_matrix[item_id].mean()
    
    else:
        #Default to a rating of the scale median in the absence of any information
        mean_rating = scale_median
    
    return mean_rating

Let's look at what the cf_mean() function would return for movie 15. Movie 15 has 2 ratings: [5,2]. You can see that it returns the average of the ratings, or 3.5.

In [14]:

cf_mean(0,15,r_matrix, 3)

Out[14]:

3.5

If we use or score function to get the predicted rating for the entire matrix, we can get the RMSE.

In [15]:

#score all the test data
score(cf_mean, X_test, r_matrix, 3)

Out[15]:

1.153411090139653

Weighted Mean

Weighted mean is going to give more weight to the users that are more similar to each other. We'll do this using cosine similarity. Let's look at the function from the book:

r_{u, m}=\frac{\sum_{u^{\prime}, u^{\prime} \neq u} \operatorname{sim}\left(u, u^{\prime}\right) \cdot r_{u^{\prime}, m}}{\sum_{u^{\prime}, u^{\prime} \neq u}\left|\operatorname{sim}\left(u, u^{\prime}\right)\right|}

What this says is that the rating for each user-item combination will be the dot product of two vectors:

the similarity scores between this user and other users
the ratings of other users

and this will be divided by the sum of the similarity scores. To calculate this value, we need a cosine similarity matrix between our user's ratings.

In [16]:

#Create a dummy ratings matrix with all null values imputed to 0
r_matrix_dummy = r_matrix.copy().fillna(0)

#Compute the cosine similarity matrix using the dummy ratings matrix
cosine_sim = cosine_similarity(r_matrix_dummy, r_matrix_dummy)

#Convert into pandas dataframe 
cosine_sim = pd.DataFrame(cosine_sim, index=r_matrix.index, columns=r_matrix.index)

cosine_sim.head(10)

Out[16]:

We covered this in the video, but here's a breakdown of how we'd calculate the rating for user 4 and movie 12 (2 in the video, but we updated the IDs). We'll fold this into our weighted mean function below, but we're pulling it out here just for clarity.

In [17]:

#extract the 2 vectors and calculate the dot product
u = [0.335659, 0.568737, 0.783880]
r = [5,1,3]
np.dot(u,r)

Out[17]:

4.598672

In [18]:

#the dot product is the same as this hand-calculation, which takes the sum of the products of the 2 values at the same index in each vector
print((5*.335659) + (1*.56737) + (3*.783880) )
#the denominator is the sum of the similarity scores
print(sum(u))
#so the full rating ends up being this
print(((5*.335659) + (1*.56737) + (3*.783880))/sum(u))

Out[18]:

597305
688276
7230766770362194

With the cosine similarity matrix in hand, we can set up the weighted mean function. This function needs 2 additional arguments - the rating_matrix and the cosine similarity matrix (c_sim_matrix).

In [19]:

r_matrix

Out[19]:

In [20]:

#User Based Collaborative Filter using Weighted Mean Ratings
def cf_wmean(user_id, item_id, ratings_matrix, c_sim_matrix, median_rating):
    
    #Check if item_id exists in r_matrix
    if item_id in ratings_matrix:
        print(f'UserID: {user_id} - ItemID {item_id}')
        #Get the similarity scores for the user in question with every other user
        sim_scores = c_sim_matrix[user_id]

        #Get the user ratings for the item in question
        i_ratings = ratings_matrix[item_id]

        #Extract the indices containing NaN in the i_ratings series
        idx = i_ratings[i_ratings.isnull()].index

        #Drop the NaN values from the i_ratings Series
        i_ratings = i_ratings.dropna()

        #Drop the corresponding cosine scores from the sim_scores series
        sim_scores = sim_scores.drop(idx)

        print(f'Final calculation: np.dot({sim_scores.tolist()}, {i_ratings.tolist()})/sum({sim_scores.tolist()})' )
        #Compute the final weighted mean
        if sim_scores.sum()>0:
            wmean_rating = np.dot(sim_scores, i_ratings)/ sim_scores.sum()
        else:  # user had zero cosine similarity with other users
            wmean_rating = median_rating

    else:
        #Default to the median in the absence of any information
        wmean_rating = median_rating
    
    return wmean_rating



#we can call our score function to get the RMSE
score(cf_wmean, X_test, r_matrix, cosine_sim, 3)

Out[20]:

UserID: 4 - ItemID 12
Final calculation: np.dot([0.33565855667130945, 0.5687367919007337, 0.783880147156683], [5.0, 1.0, 3.0])/sum([0.33565855667130945, 0.5687367919007337, 0.783880147156683])
UserID: 2 - ItemID 13
Final calculation: np.dot([0.379777262656375, 0.4045566970313675, 0.5687367919007337, 0.47565149415449404], [1.0, 5.0, 3.0, 2.0])/sum([0.379777262656375, 0.4045566970313675, 0.5687367919007337, 0.47565149415449404])
UserID: 4 - ItemID 15
Final calculation: np.dot([0.5687367919007337, 0.7518604376126321], [5.0, 2.0])/sum([0.5687367919007337, 0.7518604376126321])
UserID: 5 - ItemID 12
Final calculation: np.dot([0.12524485821702988, 0.47565149415449404, 0.575359712265399], [5.0, 1.0, 3.0])/sum([0.12524485821702988, 0.47565149415449404, 0.575359712265399])
UserID: 3 - ItemID 15
Final calculation: np.dot([0.4045566970313675, 0.575359712265399], [5.0, 2.0])/sum([0.4045566970313675, 0.575359712265399])
UserID: 1 - ItemID 15
Final calculation: np.dot([0.379777262656375, 0.12524485821702988], [5.0, 2.0])/sum([0.379777262656375, 0.12524485821702988])
UserID: 1 - ItemID 11
Final calculation: np.dot([0.379777262656375, 0.6924107336786997, 0.33565855667130945, 0.12524485821702988], [1.0, 2.0, 4.0, 5.0])/sum([0.379777262656375, 0.6924107336786997, 0.33565855667130945, 0.12524485821702988])

1.2892045169426132

Model Based Approaches

All of the above models were relatively simple and straightforward calculations, even if the code to call them was a little convoluted.

Machine learning algorithms, on the other hand, can give us a more powerful approach, with more complicated calculations. But the Surprise package makes the code to call them surprisingly simple.

We're providing some sample code below and a walkthrough video to introduce you to using the surprise package:

In [21]:

play_video('ds775_lesson14_surprise-introduction')

Out[21]:

Baseline: Normal Predictor

Surprise includes several baseline predictors. We'll take a look at the normal predictor. The normal predictor simply predicts a random number within your rating scale, and assumes that the ratings come from a normal distribution. If you look at the histogram of our ratings below, you can see that it's unlikely that our ratings follow a normal distribution. In fact, they don't. We generated them from a discrete uniform distribution - which is a distribution in which each of the numbers is equally likely. Given what we know about our actual ratings, we would not expect the normal predictor baseline to be a good baseline for our data. But, we'll use it anyway just to give you a feel for it.

In [22]:

ratings.rating.plot(kind='hist', bins=4, title='Actual Ratings')

Out[22]:

<Axes: title={'center': 'Actual Ratings'}, ylabel='Frequency'>

In [23]:

our_seed = 14

#Define a Reader object
#The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader(rating_scale=(1,5)) # defaults to (0,5)

#Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

#Define the algorithm object; in this case the normal predictor
algo = NormalPredictor() 

## apply the seeds right before cross validating, use both random and np.random
random.seed(our_seed)
np.random.seed(our_seed)
#This code cross validates (evaluates) the model
algo_cv = cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)
print(algo_cv)

#to extract the mean RMSE, we need to get the mean of the test_rmse values
algo_RMSE = np.mean(algo_cv['test_rmse'])
print(f'\nThe RMSE across five folds was {algo_RMSE}')

Out[23]:

Evaluating RMSE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.0239  2.4687  1.6427  1.7709  1.4042  1.8621  0.3634  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
{'test_rmse': array([2.02385476, 2.46873783, 1.64274102, 1.77092294, 1.40421843]), 'fit_time': (8.392333984375e-05, 3.075599670410156e-05, 2.7894973754882812e-05, 3.910064697265625e-05, 2.9802322387695312e-05), 'test_time': (4.1961669921875e-05, 1.5020370483398438e-05, 2.193450927734375e-05, 1.6689300537109375e-05, 1.33514404296875e-05)}

The RMSE across five folds was 1.8620949953786858

In [24]:

#train on the whole dataset
trainset = data.build_full_trainset()
algo.fit(trainset)

Out[24]:

<surprise.prediction_algorithms.random_pred.NormalPredictor at 0x7fd6f7b36e60>

Calling predictions

We can use our trained model to predict ratings for any user/item combination. Remember that for this particular algorithm, even unknown users or unknown items will get individual rating estimates, since the algorithm is just returning a random number anyway.

In [25]:

## apply the seeds right predicting, use both random and np.random
random.seed(our_seed)
np.random.seed(our_seed)
#first let's look at a single prediction for a user and item that were both in the dataset
print('Known user and item:', algo.predict(1, 11))
#this user wasn't in the dataset
print('Unknown User: ', algo.predict(0, 11))
#this item wasn't in the data set
print('Unknown item', algo.predict(1, 20))

Out[25]:

Known user and item: user: 1          item: 11         r_ui = None   est = 5.00   {'was_impossible': False}
Unknown User:  user: 0          item: 11         r_ui = None   est = 3.32   {'was_impossible': False}
Unknown item user: 1          item: 20         r_ui = None   est = 3.46   {'was_impossible': False}

We can generate predictions for our entire dataframe by using a lambda function on each row of data. (Note that the first row of the dataframe matches our hand-coded prediction for user 1 and movie 11 above.)

In [26]:

## apply the seeds right predicting, use both random and np.random
random.seed(our_seed)
np.random.seed(our_seed)
#run some predictions
pred_df = ratings.copy() #make a copy of the ratings that we can add columns to

#get all the predictions (this could be computationally expensive and would probably only be computed occasionally)
pred_df['prediction'] = pred_df.apply(lambda x: algo.predict(x['user_id'], x['movie_id']).est, axis=1) 

pred_df

Out[26]:

K Nearest Neighbors

When Banik looked at demographics, he was using explicit data to determine what makes people "similar" and assuming that if they were similar in that respect, their taste in movies would also be similar as well. That might be a faulty assumption. Gender, occupation, or other simple characteristics may not have any bearing on how people rate movies. But, there might be some underlying trends in the data that do result in commonalities in ratings.

K Nearest Neighbors tries to uncover these commonalities by training a model on some data and identifying clusters of users of users that are "near" one another.

Specifically, what this algorithm does is:

Find the k-nearest neighbors that have rated movie m
Outputs the average rating of the k users for the movie m

The documentation for KNNBasic goes over all the parameters you can set when you're setting up the algorithm.

Note that in this toy set, since we only have a handful of neighbors, we will need to decrease the number of neighbors (k) that the algorithm takes into consideration. Otherwise, we'll just be getting the mean of all the considered ratings in each fold.

In [27]:

our_seed = 14

#Define a Reader object
#The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader(rating_scale=(1,5)) # defaults to (0,5)

#Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

#Define the algorithm object; in this case kNN
knn = KNNBasic(k=3, verbose=False) #the default for k is 40, we're also setting verbose to False to supress messages

## apply the seeds right before cross validating, use both random and np.random
random.seed(our_seed)
np.random.seed(our_seed)
#This code cross validates (evaluates) the model
knn_cv = cross_validate(knn, data, measures=['RMSE'], cv=5, verbose=True)
print(knn_cv)

#to extract the mean RMSE, we need to get the mean of the test_rmse values
knn_RMSE = np.mean(knn_cv['test_rmse'])
print(f'\nThe RMSE across five folds was {knn_RMSE}')

Out[27]:

Evaluating RMSE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.4385  2.6866  2.1752  1.5797  2.2290  2.0218  0.4570  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
{'test_rmse': array([1.43851143, 2.6866473 , 2.17523353, 1.57965233, 2.22902989]), 'fit_time': (5.3882598876953125e-05, 2.09808349609375e-05, 1.621246337890625e-05, 1.9550323486328125e-05, 1.5735626220703125e-05), 'test_time': (6.365776062011719e-05, 3.933906555175781e-05, 0.00024509429931640625, 3.933906555175781e-05, 2.5987625122070312e-05)}

The RMSE across five folds was 2.021814896489965

Interestingly, our RMSE for the K-nearest neighbors algorithm is actually worse than our normal predictor. This is most likely because we have a tiny dataset, or we're using the wrong k value for this data.

We can use grid search to identify the best k for this set of data. We'll set up this grid search to get an unbiased accuracy metric.

In [28]:

#Define a Reader object
#The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader(rating_scale=(1,5)) # defaults to (0,5)


#Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

#get the raw ratings
raw_ratings = data.raw_ratings

# shuffle ratings - set the seed here for homework
random.seed(our_seed)
np.random.seed(our_seed)
random.shuffle(raw_ratings)

#A = 90% of the data, B = 10% of the data
threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

data.raw_ratings = A_raw_ratings  # data is now the set A

# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'k': [3,5], 'min_k': [1,3]} #this will all combinations of max k of 3 and 5 and min k of 1 and 3
grid_search = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=5)
grid_search.fit(data)

knn_gs_algo = grid_search.best_estimator['rmse']


# retrain on the whole set A
trainset = data.build_full_trainset()
knn_gs_algo.fit(trainset)

# Compute biased accuracy on A 
predictions = knn_gs_algo.test(trainset.build_testset())
print(f'Biased accuracy on A = {accuracy.rmse(predictions)}')


# Compute unbiased accuracy on B
testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
predictions = knn_gs_algo.test(testset)
print(f'Unbiased accuracy on B = {accuracy.rmse(predictions)}')

Out[28]:

Grid Search...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.6173
Biased accuracy on A = 0.6173227292459136
RMSE: 1.3349
Unbiased accuracy on B = 1.3349296871155587

In [29]:

#we can see what our best parameters were
grid_search.best_params['rmse']

Out[29]:

{'k': 5, 'min_k': 3}

Using the grid search, we were able to get a better RMSE. Extracting the best parameters shows us that the best combination of parameters was a max k of 5 and a min k of three. By default, KNNBasic uses a min k of 1, so our grid search found that always including 3 neighbors gave a better result than sometimes including fewer than 3 neighbors.

If we wanted to use this model for predictions, we'd want to retrain the model, using the best parameters, on all of our available data. We can do that by setting the data.raw_ratings back to the complete raw ratings, setting up a full trainset, instantiating our model with the best params, and fitting it on the full trainset.

In [30]:

#set our seeds again
random.seed(our_seed)
np.random.seed(our_seed)

#reset the data.raw_ratings to 100% of the data
data.raw_ratings = raw_ratings

#build a trainset
trainset = data.build_full_trainset()

#build the algorithm with our best parameters
knn_gs_algo = grid_search.best_estimator['rmse']

#fit to the data
knn_gs_algo.fit(trainset)

#predict user 1, movie 11
knn_gs_algo.predict(1, 11)

Out[30]:

Computing the msd similarity matrix...
Done computing similarity matrix.

Prediction(uid=1, iid=11, r_ui=None, est=3.747637764387509, details={'actual_k': 5, 'was_impossible': False})

Singular-value Decomposition (SVD)

The theory behind SVD is covered in Banik's book. The very high-level concept is that it's a method that allows you to reduce the dimensions of a sparse matrix and "fill in the blanks" with predictions. Under the hood, the algorithm uses stochastic gradient descent to attempt to minimize errors. We don't expect you to understand all the intricacies. We would like you to understand a couple of the hyperparameters you can tune, which are some of the same hyperparameters in every stochastic gradient descent algorithm.

n_epochs: this is the number of times the minimization steps are performed. The higher the number of times, the longer the algorithm will work to find the minimum error. The default is 20.
learning rate: this is a number that determines how much to change the model each iteration. Think of it as how big of a step the model takes in each iteration. Too large and you may never find your minimum. Too small and your model will be very slow and could get stuck in a local minimum. The default is .005.
regularization: this is a penalty term applied to prevent model overfitting. The default is .02.

We won't demonstrate using these hyperparameters in the lesson, but you'll need them for the homework. Functionally, you'd use them the same way we used k and min_k with the basicKNN algorithm.

The code to do simple cross validation with SVD itself is extremely simple, once you've already got a suprise data object set up. Read the full documentation if you're curious.

In [31]:

#Define the SVD algorithm object
svd = SVD()
## apply the seeds right before cross validating, use both random and np.random
random.seed(our_seed)
np.random.seed(our_seed)
#Evaluate the performance in terms of RMSE
svd_cv = cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=True)
#to extract the mean RMSE, we need to get the mean of the test_rmse values
svd_RMSE = np.mean(svd_cv['test_rmse'])
print(f'\nThe RMSE across five folds was {svd_RMSE}')

#train on the whole dataset
trainset = data.build_full_trainset()
svd.fit(trainset)

Out[31]:

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.7165  1.2531  1.9141  1.7461  1.1622  1.5584  0.2956  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    

The RMSE across five folds was 1.55841375329772

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fd6f7b37820>

Hybrid Recommenders

Hybrid recommenders are probably the most common recommenders you'll see in the real world, and there are many approaches to building hybrid recommenders. Banik specifically walks through a one type of recommender that combines a content-based recommender with a collaborative filter. This is a relatively simplistic hybrid recommender, but it's a good place to start, since it combines the strengths of two different recommender types.

As you might guess, we have some suggested updates to his code. All that work you did last week to set up functions is really going to help you this week. Let's pull in the functions we used last week.

In [32]:

def fetchSimilarityMatrix(df, soupCol, vectorizer, vectorType='Tfidf'):
    '''
    Parameters
    df: the dataframe containing a soup column to tranform
    soupCol: The string title of the soup column
    vectorizer: an initialized vectorizer, with all pre-processing you desire
    vectorType: 'Tfidf' or 'Count' - representing the type of vectorizer you used.

    Returns
    Sparse Similarity Matrix
    '''

    # make sure the soup has no NaN
    df[soupCol] = df[soupCol].fillna('')
    nmatrix = vectorizer.fit_transform(df[soupCol])

    #apply the appropriate vectorizer
    if(vectorType=='Tfidf'):
        print('Using Linear Kernel (Tfidf)')
        sim =linear_kernel(nmatrix, nmatrix)
    else:
        print('Using Cosine_similarity')
        sim = cosine_similarity(nmatrix, nmatrix)
    return(sim)

def content_recommender(df, seed, seedCol, sim_matrix,  topN=5): 
    #get the indices based off the seedCol
    indices = pd.Series(df.index, index=df[seedCol]).drop_duplicates()
    
    # Obtain the index of the item that matches our seed
    idx = indices[seed]
    
    # Get the pairwsie similarity scores of all items and convert to tuples
    sim_scores = list(enumerate(sim_matrix[idx]))
    
    #delete the item that was passed in
    del sim_scores[idx]
    
    # Sort the items based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the top-n most similar items.
    sim_scores = sim_scores[:topN]
    
    # Get the item indices
    item_indices = [i[0] for i in sim_scores]
    
    # Return the topN most similar items
    return df.iloc[item_indices].copy()

The first thing we need to do is to decide which data to use for our content recommender. Since we're dealing with a toy dataset anyway, let's keep this simple and just use the genres as our soup. All we'll need to do to create our soup columns is to join the genres into a string. We'll use a simple count vectorizer with 'english' stopwords and no restriction on the max number of features.

In [33]:

#make a soup from genres
movies['soup'] = movies['genres'].apply(lambda x: ' '.join(x))

#set up a count vectorizer
count = CountVectorizer(lowercase=True, stop_words='english')

#fetch our similarity matrix
sim = fetchSimilarityMatrix(movies, 'soup', count, 'Count')

#display it
pd.DataFrame(sim, index=movies['title'], columns=movies['title'])

Out[33]:

Using Cosine_similarity

With our similarity matrix in hand, we can fetch our most similar movies. Let's fetch the 3 most similar movies to Jumanji. Since we have such a small dataset, that means we're only eliminating one. If you look at the dataframe above, can you predict which one will be eliminated?

In [34]:

results = content_recommender(movies, 'Jumanji', 'title', sim, 3)
results

Out[34]:

We already have a variety of trained algorithms we could use to predict how a certain user would rate these movies. Let's add a prediction for user 1 to our results dataframe using our trained knn_gs_algo, and sort in descending order by those predictions

In [35]:

results['est_rating'] = results.apply(lambda x: knn_gs_algo.predict(1, x['movie_id']).est, axis=1)
results.sort_values('est_rating', ascending=False)

Out[35]:

Given this data, we'd want to recommend 'Balto' or 'The Wizard of Oz.' Would we want to recommend 'Treasure Island?' It's similar to 'Jumanji,' but our recommender estimates we wouldn't rate it very highly.

Putting it together

Let's wrap up what we did above in a nice hybrid function. There's multiple ways we could do this. We'll show you one of them. In this approach, we're doing our content recommender outside our hybrid and passing in the resulting recommendations. In the homework, we'll ask you to pass in all the parameters needed to run your content recommender inside your hybrid function.

In [36]:

def hybrid(user,contentRecs, predCol,algorithm, N):
    '''
    Parameters
    user: the user for whom we are making predictions
    contentRecs: the dataframe of items (already limited by whatever content recommender you've used)
    predCol: the column in contentRecs on which we'll be making predictions (the itemID column)
    algorithm: a trained Surprise model that will be used for making predictions
    N: the number of predictions to return
    Returns
    a pandas dataframe containing everything in the contentRecs dataframe, plus the estimated rating for the user requested
    '''
    
    #generated predicted ratings
    contentRecs['est_rating'] = contentRecs.apply(lambda x: algorithm.predict(user, x[predCol]).est, axis=1)
    #sort the results
    contentRecs = contentRecs.sort_values('est_rating', ascending=False)

    #return the finalN number of results
    return contentRecs.head(N)

#here we're calling it with user 1, the results from our prior content_recommender function call, the 'movie_id' is the column we want ratings for using the 'knn_gs_algo' model and returning 2 recommendations
hybrid(1, results, 'movie_id', knn_gs_algo, 2)

Out[36]:

Self Assessment

Follow the examples and use the code files provided from from chapters 5-7 in Hands-On Recommendation Systems with Python by Rounak Banik to do the following self-assessment exercises.

The self-assessments in this lesson will be using a subset of data from the Book-Crossing dataset. Click here for more details on the Book-Crossing dataset.

User-Based Collaborative Filter

Self-Assessment: Setting up the File

The file BX-Book-Ratings-3000.csv (found in the presentation download for this lesson) is loaded here for you, though you may need to change the file path unless you create the same folder structure. Note that book ratings have been adjusted so the scale goes from 1 to 11.

Run the cell below it to load the file and then do the following:

display the first 5 lines of the data (get familiar with the data frame)
calculate the mean book rating for all books (just to get an idea)
split the data set so that 70% of a users ratings are in the training set and 30% are in the testing set

In [37]:

# load the data
import pandas as pd
bx = pd.read_csv('./data/BX-Book-Ratings-3000.csv')

In [38]:

# enter your code here

Self-Assessment: Baseline RMSE to Assess Model Performance

Build a baseline model that assigns a neutral rating and compute the RMSE of these simple "predictions" using the testing set. Make sure to make this model accept *args so that it aligns with more complicated models.

A neutral rating would occur at the midpoint of the rating scale. Calculate the median of the rating scale to determine what the baseline model should return.

In [39]:

# enter your code here

Self-Assessment: Weighted Mean User-Based Filter

Build a ratings matrix from the data frame of users, books, and ratings and build a user-based collaborative filtering model that weights mean rank using cosine similarity among users. Fit the model on the training set and compute the RMSE for this model using the test set and compare it to the RMSE of the baseline model. Is it better than baseline? (i.e. is the RMSE smaller?)

In [40]:

# enter your code here

Self-Assessment: Weighted Mean Item-Based Filter

Create a new ratings matrix from the data frame of users, books, and ratings with the rows defined by books (i.e. items) and columns defined by users to build an item-based collaborative filtering model that weights mean rank using cosine similarity among items. Fit the model on the training set and compute the RMSE for this model on the test and compare it to the RMSEs of the baseline and weighted mean user-based models. Is this one better than baseline?

In [41]:

# enter your code here

Self-Assessment: kNN-Based Collaborative Filter

Use the surprise library in Python to build an kNNBasic collaborative filtering model for the BX-Books ratings. Compute the average RMSE for this model from 5 cross-validations, using a k of 5. Do not tune the hyperparameters. Compare it to the RMSEs of the baseline, weighted mean user-based, and weighted mean item-based models previously obtained. Use a seed of 1.

In [42]:

# enter your code here

Self-Assessment: KNNBasic Item-based Collaborative Filter

Surprise makes it easy to switch between user-based and item-based collaborative filtering. By default it's always using user-based. To switch it to item-based, you need to set the sim_options dictionary key "user_based" to False

Set up a sim_options dictionary that sets user_based to false and k to 5. Use those sim_options to instantiate your KNNBasic algorithm and run 5-fold cross validation on all the data.

Consult the documentation for examples.

In [43]:

#enter your code here

Self-Assessment: Hybrid Recommender

Another kind of hybrid recommender system is one which uses multiple recommenders, but weights the recommendations of each recommender to achieve a final set of recommendations. For this self assessment, you're going to write a function that takes a ratings dataframe, a userid, 2 trained Surprise models, a weight (that's less than 1 - meaning it will be a decimal representing the percent of confidence we have in that algorithm) for your first recommender, and a number indicating how many recommendations to return.

Your recommender should do the following:

generate a dataframe of unique items from the ratings dataframe (there are multiple ways to accomplish this)
generate predicted rating for each combination of passed in userid and item using the first Surprise model (add a column to your unique items dataframe)
generate predicted rating for each combination of passed in userid and item using the second Surprise model (add another column to your unique items dataframe)
generate a final rating that multiplies the predicted rating of each model by their weight and adds them. (Use a lambda function and create a finalRating column.) Note that your weights should add up to 1, so your weight for your second recommender will be 1-the weight for your first recommender.
sort your unique items by the final rating and return the top N recommendations with their ratings.

To test your function:

Use userid 31315
Use an item-based KNNBasic algorithm. Do not set k (just let it use the default). Train it on the complete dataset (do not cross validate).
Use an SVD algorithm with all the default parameters. Train it on the entire dataset. (Do not cross-validate.)
Weight the KNNBasic algorithm with .6.
Return the top 10 book recommendations.

In [42]:

# enter your code here