GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Homework/Lesson 14 HW - RecSys 2/Homework_14.ipynb
⁸⁷¹ views

Kernel: Python 3 (system-wide)

In [1]:

# EXECUTE FIRST

# computational imports
import numpy as np
import pandas as pd
pd.set_option('display.html.use_mathjax', False)
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Reader, Dataset, KNNBasic, NormalPredictor,BaselineOnly,KNNWithMeans,KNNBaseline
from surprise import SVD, SVDpp, NMF, SlopeOne, CoClustering
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import accuracy

import random
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
matplotlib.style.use('ggplot')

# display imports
from IPython.display import display, IFrame
from IPython.core.display import HTML

Lesson 14 Homework: Recommender Systems 2

When asking questions about homework in Piazza please use a tag in the subject line like HW14.7 to refer to Homework 14, Question 7. So the subject line might be HW14.7 question. Note there are no spaces in "HW14.7". This really helps keep Piazza easily searchable for everyone!

For full credit, all code in this notebook must be both executed in this notebook and copied to the Canvas quiz where indicated.

Question 1 (2 points)

Which of the following recommenders is based on the user/item ratings? (Check all that apply.)

SVD item-based collaborative filter
KNN user-based collaborative filter
Content recommender
Knowledge-based recommender
Chart

Question 2 (2 points)

Which Surprise algorithm reduces the size of the problem space through matrix factorization?

NormalPredictor
KNNBasic
KNNWithMeans
BaselineOnly
SVD
KNNWithZScores

Data Exploration

(Note: This section is not included in the quiz and is ungraded.)

The file restaurant_ratings.csv (found in the presentation download for this lesson) contains user ratings for various New York City restaurants. You can read a little more about the data at Kaggle. We have modified the data to generate user ratings that match the star columns in this file.

Do the following:

read the data into a variable called "ratings"
display the first 5 lines of the data (get familiar with the data frame)
find the minimum restaurant rating
find the maximum restaurant rating
adjust the rating scale by shifting up 1 if 0 is included

In [1]:

#Add your code here

Question 3 (2 points)

What is the minimum restaurant rating?

In [2]:

#Add your code here

Question 4 (2 points)

What is the maximum restaurant rating?

In [3]:

#Add your code here

Question 5 (2 points)

What is the mean restaurant rating for all restaurants (rounded to 2 significant digits)?

In [4]:

#Add your code here

Question 6 (2 points)

What is the median of the restaurant rating scale?

In [5]:

#Add your code here

Train/Test Split and Score Setup

(Note: this section is not included in the quiz and is not graded.)

We've provided code to you below for a scoring function and to split the data into train and test sets. Use the train and test set generated from this code to answer the next questions. You must not change this code if you want to get the correct answers.

In [7]:

#This section not included in quiz/solutions.

#Function to compute the RMSE score obtained on the testing set by a model
def score(cf_model, X_test, *args):
    
    #Construct a list of user-item tuples from the testing dataset
    id_pairs = zip(X_test[X_test.columns[0]], X_test[X_test.columns[1]])
    
    #Predict the rating for every user-item tuple
    y_pred = np.array([cf_model(user, item, *args) for (user, item) in id_pairs])
    
    #Extract the actual ratings given by the users in the test data
    y_true = np.array(X_test[X_test.columns[2]])
    
    #Return the final RMSE score
    return mean_squared_error(y_true, y_pred, squared=False)

#Assign X as the original ratings dataframe and y as the user_id column of ratings.
X = ratings.copy()
y = ratings['userID']

#Split into training and test datasets, stratified along user_id
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.20, random_state=14)

Question 7 (2 points)

Compute a baseline model that always returns the median of the rating scale (rounded to 2 significant digits). What is the RMSE on this model?

In [6]:

#Add your code here

Question 8 Build a Weighted Mean User-Based Filter (manually graded) (4 points)

From data in the file restaurant_rating.csv, build a ratings matrix from the data frame of users, restaurants, and ratings and build a user-based collaborative filtering model that weights mean rank using cosine similarity among users.

In [9]:

# Add your code here

Question 9 2 points

What is the RMSE (rounded to 2 significant digits) of the Weighted Mean algorithm?

In [7]:

#Add your code here

Question 10 User-Based SVD - Hyperparameter tuning (Manually Graded) (4 points)

From data in the file restaurant_ratings.csv, use the surprise library in Python to build an SVD user-based collaborative filtering model for the restaurant ratings. Use gridsearch to tune the hyperparameters, reserving 15% of the data to get an unbiased estimate of the accuracy. For the grid, use the following options:

'n_epochs': [15, 20, 25] (The number of iterations of the Stochastic Gradient Descent minimization procedure.)
'lr_all': [.005, .025, .001] (The learning rate.)
'reg_all': [.01, .02, .05] (The penalty for complex models.)

Additionally, use the following:

3 folds for cross validation
a seed of 14

Use the example from the lesson and be sure to set the seed in the appropriate place. Note: this code will take several minutes to run.

In [8]:

#Add your code here

Question 11 (2 points)

What is the biased accuracy (rounded to 2 significant digits) of the algorithm?

In [9]:

# Add your code here

Question 12 (2 points)

What is the unbiased accuracy (rounded to 2 significant digits) of the algorithm?

In [10]:

#Add your code here

Question 13 (2 points)

What is the number of iterations of the stochastic gradient descent ('n_epochs') value chosen by the grid search?

In [11]:

#Add your code here

Question 14 (2 points)

What is the learning rate ('lr_all') chosen by the grid search?

In [12]:

#Add your code here

Question 15 (2 points)

What is the regularization ('reg_all') chosen by the grid search?

In [13]:

#Add your code here

Question 16 (2 points)

Now that we know what our best parameters should be, we need to train our SVD model on all the available data. Do the following:

set the seeds for reproducibility
reset the data.raw_ratings to all of the ratings OR reload the data from the dataframe
use the build_full_trainset() method to build a full trainset
set up an SVD algorithm using the best parameters
fit the data to the trainset
predict the estimated rating for user 1061 and restaurant 347

What is the predicted estimated rating (rounded to 2 digits) for user 1061 and restaurant 347?

In [14]:

#Add your code here

Hybrid Filter Setup

(Note: This section is not included in the quiz/solutions.)

From data in the files restaurant_ratings.csv and restaurants.csv build a recommender system that is a hybrid of a metadata content-based recommender and the SVD user-based collaborative filter that you just trained.

To set up your hybrid filter:

read in the restaurants.csv into a variable called rest
review the data in the dataframe (Note that we have pre-cleaned the data for you, including using TextBlob to extract just the relevant descriptors from the description. Not all restaurants have a description.)
make a soup from the following columns, which are all simple strings (Hint: the soup for the first item in the geoplaces dataframe should be: 'Contemporary American Average_price rustic airy adorable classic most distinguished uncommon innovative American proud only world-class week.IMPORTANT special welcome'):
- restaurant_type
- price_range
- ambiance
- descriptors
Instantiate a CountVectorizer with no stopwords (use stop_words = None). (We shouldn't have much in the way of stopwords, since it's all keywords.)
Use the provided fetchSimilarity function to get a countVectorizer similarity matrix using the soup column. (Hint: the similarity at [0,2] should be 0.2849014411490949.)

In [18]:

# Not Included in Quiz/Solutions
def fetchSimilarityMatrix(df, soupCol, vectorizer, vectorType='Tfidf'):
    '''
    Parameters
    df: the dataframe containing a soup column to tranform
    soupCol: The string title of the soup column
    vectorizer: an initialized vectorizer, with all pre-processing you desire
    vectorType: 'Tfidf' or 'Count' - representing the type of vectorizer you used.

    Returns
    Sparse Similarity Matrix
    '''

    # make sure the soup has no NaN
    df[soupCol] = df[soupCol].fillna('')
    nmatrix = vectorizer.fit_transform(df[soupCol])

    #apply the appropriate vectorizer
    if(vectorType=='Tfidf'):
        print('Using Linear Kernel (Tfidf)')
        sim =linear_kernel(nmatrix, nmatrix)
    else:
        print('Using Cosine_similarity')
        sim = cosine_similarity(nmatrix, nmatrix)
    return(sim)

def content_recommender(df, seed, seedCol, sim_matrix,  topN=5): 
    #get the indices based off the seedCol
    indices = pd.Series(df.index, index=df[seedCol]).drop_duplicates()
    
    # Obtain the index of the item that matches our seed
    idx = indices[seed]
    
    # Get the pairwsie similarity scores of all items and convert to tuples
    sim_scores = list(enumerate(sim_matrix[idx]))
    
    #delete the item that was passed in
    del sim_scores[idx]
    
    # Sort the items based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the top-n most similar items.
    sim_scores = sim_scores[:topN]
    
    # Get the item indices
    movie_indices = [i[0] for i in sim_scores]
    
    snip = df.iloc[movie_indices].copy()
    snip['sim_score'] = [i[1] for i in sim_scores]
    
    # Return the topN most similar items
    return snip

Question 17 Use The Content Recommender (2 points)

Using the provided content recommender function and the code you've prepared, get the top 5 recommendations for 'Tao Uptown'. (Hint: the top restaurant for 'Becco' should be 'Scampi'.)

Which if these restaurants is the top recommendation?

Haru Sushi - Amsterdam Ave
Bistrot Leo
Rice & Gold
Zengo - NYC
Restaurant Nippon

In [15]:

#Your code here

Question 18 - Build the Hybrid Function (manually graded) (4 points)

Some times recommendation designers are less focused on recommending things that have the highest rating, and more focused on recommending things that will have an acceptable rating, but are very similar to items the user has previously liked. For the homework, we're going to build a hybrid recommender that first identifies the most similar movies content-wise, estimates the ratings, and returns the most highly rated movies. We'll follow the example used in the lesson in which we will pre-fetch the content recommendations, and pass those pre-fetched recommendations into the hybrid function.

The full list of parameters needed will be:

user: the userid for which we are making predictions
contentRecs: the dataframe that contains the content recommendations, with similarity scores (this is returned for you in the content_recommender function we provided)
algo: the trained algorithm to use for colaborative filtering
predCol: the column in your contentRecs that can be used for predictions
minRating: the minimum rating we'll accept (estimated ratings should be >= to this number)
N: the final number of recommendations to return

Your function should return a dataframe that contains all of the information that was in your contentRecs plus the estimated rating for the "N" number of rows.

In [36]:

#Your code here

Question 19 - Calling the Hybrid Function (2 points)

Use your hybrid function to find recommendations for user 1235 and restaurant 'Lido'.

Remember, you will need to call your content_recommender function first to get the similarity scores.
Pass the top 25 restaurants with the highest sim_scores from the content recommender to the collaborative recommender.
Use the SVD algorithm you trained in Question 10 to predict ratings.
The minimum allowed rating is 4.5.
Return the top 3 recommendations.

Which answer shows the top 3 recommendations, in order?

Hint: If make recommendations for user 1061, and 'Schilling' and everything else the same, the top recommendation should be Trattoria Italienne.

Naples 45 Ristorante E Pizzeria, Obica Mozzarella Bar Pizza e Cucina, La Pecora Bianca - NoMad
Il Mulino New York - Downtown, Bocca di Bacco, Felice 64 Wine Bar
Becco, La Pecora Bianca - Midtown, Stella 34 Trattoria
La Pecora Bianca - NoMad, La Pecora Bianca - Midtown, Stella 34 Trattoria
Esca, Lincoln Ristorante, La Pecora Bianca - Midtown

In [16]:

#Add your code here

Question 20 KNNWithMeans item-based collaborative filter (manually graded)(4 points)

Train a KNNWithMeans Surprise collaborative filter. We ran a gridsearch already and learned that the best k for this is 3, and we get the best results using an item-based similarity measure. You should:

Set seeds of 14
Read in the data and set up your reader
Set up a data object
Build a full trainset
set up a KNNWithMeans algorithm using the following parameters:
- k of 3
- set the sim_options 'user_based' to False (this switches it to an item-based similarity measure, instead of a user-based).
fit the algorithm using the full trainset
predict the rating for user 1000 and restaurant 300

Hint: the predicted rating for user 1000 and restaurant 300 should be 4.32

In [17]:

#Add your code here

Use your hybrid function again with user 1243 and restaurant 'Lido'.

Remember, you will need to call your content_recommender function first to get the similarity scores.
Pass the top 25 restaurants with the highest sim_scores from the content recommender to the collaborative recommender.
Use the KNN algorithm you just trained predict ratings.
The minimum allowed rating is 4.5.
Return the top 3 recommendations.

Hint: If you call your function with user 1001 and Becco, the top recommendation should be Gran Morsi.

What are the top 3 restaurants, in order?

Bar Primi, Naples 45 Ristorante E Pizzeria, La Pecora Bianca - NoMad
Il Mulino New York - Uptown, Naples 45 Ristorante E Pizzeria, Bar Primi
Felice 64 Wine Bar, Lincoln Ristorante, Scampi
Trattoria Italienne, Taralluci e Vino Union Square, Felice 64 Wine Bar
La Pecora Bianca - Midtown, La Pecora Bianca - NoMad, Naples 45 Ristorante E Pizzeria

In [1]:

#Your code here

In [0]: