Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
DataScienceUWL
GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Homework/Lesson 14 HW - RecSys 2/Homework_14.ipynb
871 views
Kernel: Python 3 (system-wide)
# EXECUTE FIRST # computational imports import numpy as np import pandas as pd pd.set_option('display.html.use_mathjax', False) from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.metrics.pairwise import cosine_similarity from surprise import Reader, Dataset, KNNBasic, NormalPredictor,BaselineOnly,KNNWithMeans,KNNBaseline from surprise import SVD, SVDpp, NMF, SlopeOne, CoClustering from surprise.model_selection import cross_validate from surprise.model_selection import GridSearchCV from surprise import accuracy import random from ast import literal_eval from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import linear_kernel # plotting imports import matplotlib.pyplot as plt import seaborn as sns sns.set_style("darkgrid") matplotlib.style.use('ggplot') # display imports from IPython.display import display, IFrame from IPython.core.display import HTML

Lesson 14 Homework: Recommender Systems 2

When asking questions about homework in Piazza please use a tag in the subject line like HW14.7 to refer to Homework 14, Question 7. So the subject line might be HW14.7 question. Note there are no spaces in "HW14.7". This really helps keep Piazza easily searchable for everyone!

For full credit, all code in this notebook must be both executed in this notebook and copied to the Canvas quiz where indicated.

Question 1 (2 points)

Which of the following recommenders is based on the user/item ratings? (Check all that apply.)

  • SVD item-based collaborative filter

  • KNN user-based collaborative filter

  • Content recommender

  • Knowledge-based recommender

  • Chart

Question 2 (2 points)

Which Surprise algorithm reduces the size of the problem space through matrix factorization?

  • NormalPredictor

  • KNNBasic

  • KNNWithMeans

  • BaselineOnly

  • SVD

  • KNNWithZScores

Data Exploration

(Note: This section is not included in the quiz and is ungraded.)

The file restaurant_ratings.csv (found in the presentation download for this lesson) contains user ratings for various New York City restaurants. You can read a little more about the data at Kaggle. We have modified the data to generate user ratings that match the star columns in this file.

Do the following:

  • read the data into a variable called "ratings"

  • display the first 5 lines of the data (get familiar with the data frame)

  • find the minimum restaurant rating

  • find the maximum restaurant rating

  • adjust the rating scale by shifting up 1 if 0 is included

#Add your code here

Question 3 (2 points)

What is the minimum restaurant rating?

#Add your code here

Question 4 (2 points)

What is the maximum restaurant rating?

#Add your code here

Question 5 (2 points)

What is the mean restaurant rating for all restaurants (rounded to 2 significant digits)?

#Add your code here

Question 6 (2 points)

What is the median of the restaurant rating scale?

#Add your code here

Train/Test Split and Score Setup

(Note: this section is not included in the quiz and is not graded.)

We've provided code to you below for a scoring function and to split the data into train and test sets. Use the train and test set generated from this code to answer the next questions. You must not change this code if you want to get the correct answers.

#This section not included in quiz/solutions. #Function to compute the RMSE score obtained on the testing set by a model def score(cf_model, X_test, *args): #Construct a list of user-item tuples from the testing dataset id_pairs = zip(X_test[X_test.columns[0]], X_test[X_test.columns[1]]) #Predict the rating for every user-item tuple y_pred = np.array([cf_model(user, item, *args) for (user, item) in id_pairs]) #Extract the actual ratings given by the users in the test data y_true = np.array(X_test[X_test.columns[2]]) #Return the final RMSE score return mean_squared_error(y_true, y_pred, squared=False) #Assign X as the original ratings dataframe and y as the user_id column of ratings. X = ratings.copy() y = ratings['userID'] #Split into training and test datasets, stratified along user_id X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.20, random_state=14)

Question 7 (2 points)

Compute a baseline model that always returns the median of the rating scale (rounded to 2 significant digits). What is the RMSE on this model?

#Add your code here

Question 8 Build a Weighted Mean User-Based Filter (manually graded) (4 points)

From data in the file restaurant_rating.csv, build a ratings matrix from the data frame of users, restaurants, and ratings and build a user-based collaborative filtering model that weights mean rank using cosine similarity among users.

# Add your code here

Question 9 2 points

What is the RMSE (rounded to 2 significant digits) of the Weighted Mean algorithm?

#Add your code here

Question 10 User-Based SVD - Hyperparameter tuning (Manually Graded) (4 points)

From data in the file restaurant_ratings.csv, use the surprise library in Python to build an SVD user-based collaborative filtering model for the restaurant ratings. Use gridsearch to tune the hyperparameters, reserving 15% of the data to get an unbiased estimate of the accuracy. For the grid, use the following options:

  • 'n_epochs': [15, 20, 25] (The number of iterations of the Stochastic Gradient Descent minimization procedure.)

  • 'lr_all': [.005, .025, .001] (The learning rate.)

  • 'reg_all': [.01, .02, .05] (The penalty for complex models.)

Additionally, use the following:

  • 3 folds for cross validation

  • a seed of 14

Use the example from the lesson and be sure to set the seed in the appropriate place. Note: this code will take several minutes to run.

#Add your code here

Question 11 (2 points)

What is the biased accuracy (rounded to 2 significant digits) of the algorithm?

# Add your code here

Question 12 (2 points)

What is the unbiased accuracy (rounded to 2 significant digits) of the algorithm?

#Add your code here

Question 13 (2 points)

What is the number of iterations of the stochastic gradient descent ('n_epochs') value chosen by the grid search?

#Add your code here

Question 14 (2 points)

What is the learning rate ('lr_all') chosen by the grid search?

#Add your code here

Question 15 (2 points)

What is the regularization ('reg_all') chosen by the grid search?

#Add your code here

Question 16 (2 points)

Now that we know what our best parameters should be, we need to train our SVD model on all the available data. Do the following:

  • set the seeds for reproducibility

  • reset the data.raw_ratings to all of the ratings OR reload the data from the dataframe

  • use the build_full_trainset() method to build a full trainset

  • set up an SVD algorithm using the best parameters

  • fit the data to the trainset

  • predict the estimated rating for user 1061 and restaurant 347

What is the predicted estimated rating (rounded to 2 digits) for user 1061 and restaurant 347?

#Add your code here

Hybrid Filter Setup

(Note: This section is not included in the quiz/solutions.)

From data in the files restaurant_ratings.csv and restaurants.csv build a recommender system that is a hybrid of a metadata content-based recommender and the SVD user-based collaborative filter that you just trained.

To set up your hybrid filter:

  • read in the restaurants.csv into a variable called rest

  • review the data in the dataframe (Note that we have pre-cleaned the data for you, including using TextBlob to extract just the relevant descriptors from the description. Not all restaurants have a description.)

  • make a soup from the following columns, which are all simple strings (Hint: the soup for the first item in the geoplaces dataframe should be: 'Contemporary American Average_price rustic airy adorable classic most distinguished uncommon innovative American proud only world-class week.IMPORTANT special welcome'):

    • restaurant_type

    • price_range

    • ambiance

    • descriptors

  • Instantiate a CountVectorizer with no stopwords (use stop_words = None). (We shouldn't have much in the way of stopwords, since it's all keywords.)

  • Use the provided fetchSimilarity function to get a countVectorizer similarity matrix using the soup column. (Hint: the similarity at [0,2] should be 0.2849014411490949.)

# Not Included in Quiz/Solutions def fetchSimilarityMatrix(df, soupCol, vectorizer, vectorType='Tfidf'): ''' Parameters df: the dataframe containing a soup column to tranform soupCol: The string title of the soup column vectorizer: an initialized vectorizer, with all pre-processing you desire vectorType: 'Tfidf' or 'Count' - representing the type of vectorizer you used. Returns Sparse Similarity Matrix ''' # make sure the soup has no NaN df[soupCol] = df[soupCol].fillna('') nmatrix = vectorizer.fit_transform(df[soupCol]) #apply the appropriate vectorizer if(vectorType=='Tfidf'): print('Using Linear Kernel (Tfidf)') sim =linear_kernel(nmatrix, nmatrix) else: print('Using Cosine_similarity') sim = cosine_similarity(nmatrix, nmatrix) return(sim) def content_recommender(df, seed, seedCol, sim_matrix, topN=5): #get the indices based off the seedCol indices = pd.Series(df.index, index=df[seedCol]).drop_duplicates() # Obtain the index of the item that matches our seed idx = indices[seed] # Get the pairwsie similarity scores of all items and convert to tuples sim_scores = list(enumerate(sim_matrix[idx])) #delete the item that was passed in del sim_scores[idx] # Sort the items based on the similarity scores sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # Get the scores of the top-n most similar items. sim_scores = sim_scores[:topN] # Get the item indices movie_indices = [i[0] for i in sim_scores] snip = df.iloc[movie_indices].copy() snip['sim_score'] = [i[1] for i in sim_scores] # Return the topN most similar items return snip

Question 17 Use The Content Recommender (2 points)

Using the provided content recommender function and the code you've prepared, get the top 5 recommendations for 'Tao Uptown'. (Hint: the top restaurant for 'Becco' should be 'Scampi'.)

Which if these restaurants is the top recommendation?

  • Haru Sushi - Amsterdam Ave

  • Bistrot Leo

  • Rice & Gold

  • Zengo - NYC

  • Restaurant Nippon

#Your code here

Question 18 - Build the Hybrid Function (manually graded) (4 points)

Some times recommendation designers are less focused on recommending things that have the highest rating, and more focused on recommending things that will have an acceptable rating, but are very similar to items the user has previously liked. For the homework, we're going to build a hybrid recommender that first identifies the most similar movies content-wise, estimates the ratings, and returns the most highly rated movies. We'll follow the example used in the lesson in which we will pre-fetch the content recommendations, and pass those pre-fetched recommendations into the hybrid function.

The full list of parameters needed will be:

  • user: the userid for which we are making predictions

  • contentRecs: the dataframe that contains the content recommendations, with similarity scores (this is returned for you in the content_recommender function we provided)

  • algo: the trained algorithm to use for colaborative filtering

  • predCol: the column in your contentRecs that can be used for predictions

  • minRating: the minimum rating we'll accept (estimated ratings should be >= to this number)

  • N: the final number of recommendations to return

Your function should return a dataframe that contains all of the information that was in your contentRecs plus the estimated rating for the "N" number of rows.

#Your code here

Question 19 - Calling the Hybrid Function (2 points)

Use your hybrid function to find recommendations for user 1235 and restaurant 'Lido'.

  • Remember, you will need to call your content_recommender function first to get the similarity scores.

  • Pass the top 25 restaurants with the highest sim_scores from the content recommender to the collaborative recommender.

  • Use the SVD algorithm you trained in Question 10 to predict ratings.

  • The minimum allowed rating is 4.5.

  • Return the top 3 recommendations.

Which answer shows the top 3 recommendations, in order?

Hint: If make recommendations for user 1061, and 'Schilling' and everything else the same, the top recommendation should be Trattoria Italienne.

  • Naples 45 Ristorante E Pizzeria, Obica Mozzarella Bar Pizza e Cucina, La Pecora Bianca - NoMad

  • Il Mulino New York - Downtown, Bocca di Bacco, Felice 64 Wine Bar

  • Becco, La Pecora Bianca - Midtown, Stella 34 Trattoria

  • La Pecora Bianca - NoMad, La Pecora Bianca - Midtown, Stella 34 Trattoria

  • Esca, Lincoln Ristorante, La Pecora Bianca - Midtown

#Add your code here

Question 20 KNNWithMeans item-based collaborative filter (manually graded)(4 points)

Train a KNNWithMeans Surprise collaborative filter. We ran a gridsearch already and learned that the best k for this is 3, and we get the best results using an item-based similarity measure. You should:

  • Set seeds of 14

  • Read in the data and set up your reader

  • Set up a data object

  • Build a full trainset

  • set up a KNNWithMeans algorithm using the following parameters:

  • fit the algorithm using the full trainset

  • predict the rating for user 1000 and restaurant 300

Hint: the predicted rating for user 1000 and restaurant 300 should be 4.32

#Add your code here

Use your hybrid function again with user 1243 and restaurant 'Lido'.

  • Remember, you will need to call your content_recommender function first to get the similarity scores.

  • Pass the top 25 restaurants with the highest sim_scores from the content recommender to the collaborative recommender.

  • Use the KNN algorithm you just trained predict ratings.

  • The minimum allowed rating is 4.5.

  • Return the top 3 recommendations.

Hint: If you call your function with user 1001 and Becco, the top recommendation should be Gran Morsi.

What are the top 3 restaurants, in order?

  • Bar Primi, Naples 45 Ristorante E Pizzeria, La Pecora Bianca - NoMad

  • Il Mulino New York - Uptown, Naples 45 Ristorante E Pizzeria, Bar Primi

  • Felice 64 Wine Bar, Lincoln Ristorante, Scampi

  • Trattoria Italienne, Taralluci e Vino Union Square, Felice 64 Wine Bar

  • La Pecora Bianca - Midtown, La Pecora Bianca - NoMad, Naples 45 Ristorante E Pizzeria

#Your code here