Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
DataScienceUWL
GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Homework/Lesson 13 HW - RecSys1/Homework_13.ipynb
871 views
Kernel: Python 3 (system-wide)
#Not included in Quiz/Solutions # execute to import notebook styling for tables and width etc. from IPython.core.display import HTML # computational imports import numpy as np import pandas as pd from ast import literal_eval from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import linear_kernel import nltk from nltk.tokenize import sent_tokenize from nltk import word_tokenize nltk.download('averaged_perceptron_tagger') from sklearn.feature_extraction import text from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet as wn import string
[nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /home/user/nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date!

Week 13 Homework - Recommender Systems 1

When asking questions about homework in Piazza please use a tag in the subject line like HW1.3 to refer to Homework 1, Question 3. So the subject line might be HW1.3 question. Note there are no spaces in "HW1.3". This really helps keep Piazza easily searchable for everyone!

For full credit, all code in this notebook must be both executed in this notebook and copied to the Canvas quiz where indicated.

General Multiple Choice Questions

Question 1 2 points

When would you use dot-product similarity function?

  • To calculate the similarity matrix for a Tfidf Vector Matrix

  • To calculate the similarity matrix for a Count Vector Matrix

  • To standardize text to root words

  • To combine columns of text before vectorization

Question 2 2 points

What is lemmatization?

  • Shortening words by removing suffixes and prefixes

  • Standardizing text to their root words

  • Generating a matrix of word counts

  • Chunking text into multi-word phrases

Build a Knowledge-Based Recommender

You will be using the data set tmdb-simplified.csv to build a simple knowledge-based recommender system. This data set can be found in the data folder in the same folder as this notebook.

You will need to use the option encoding = "ISO-8859-1" in the read_csv function in order to open this file.

  • Read in the file to a variable called "movies" and review the data.

  • Apply literal_eval to the genres, keywords, and production_companies columns. (They are already lists, not dictionaries.)

  • Filter out movies that have have nothing or zero in the budget.

  • Determine how many rows are in this dataframe.

Note: This code is ungraded.

Question 3 - How many rows of data are there? 1 point

You won't get the right answer if you don't filter out rows with NaNs and zero values for budget.

#Add your code here

Question 4 - Prep Work & Building a Filter Function (manually graded) 5 points

Before we build the recommender function that allows for user input, we're going to write a filter function that takes in manual (coded) input and filters our dataframe. Your function should take in parameters for the dataframe, two genres, a production company, and max budget. The filter should identify movies that meet the following criteria:

  • Have either genre

  • Are NOT made by the production company (the production company is not in the list of production companies)

  • Have a budget that is less than or equal to the max budget.

The function should return the filtered dataframe.

We've given you the function definition. Fill in the code.

Use the examples given in the lesson and Banik's book as a guide. (Do not explode. Use the lesson approach.)

def filterMovies(df, genre1, genre2, company, budgetmax): ''' Parameters: df: The pandas dataframe to filter genre1: A possible genre genre2: Another possible genre company: A production company that can not be in the production company column budgetmax: The maximum budget allowed Returns: a filtered dataframe ''' ############## #write your code here ##############

Hint: If you call your function with the following parameters, you should be left with 27 movies:

* genres of 'action' and 'adventure' * production company: 'Beijing New Picture Film Co. Ltd.' * max budget: 1000000

Question 5 Calling Your Filter Function 2 points

Call your function using the following parameters:

  • genres of 'action' and 'crime'

  • the production company 'Columbia Pictures'

  • max budget of 2 million (2000000).

Report how many movies are left.

#Add your code here

Question 6 - Fetch the List of Unique Genres (multiple choice) 2 points

Using the examples from the lesson, generate a string of unique genres. Sort the genres alphabetically. Note: for Questions 5-7 you should be using the dataframe you produced in Question 3.

What is the 3rd word in the sorted string list?

  • fantasy

  • animation

  • comedy

  • adventure

  • crime

#Add your code here

Question 7 - Count the Number of Unique Production Companies 2 points

Using the examples from the lesson, generate a numpy array of production companies and determine the length of that array. How many unique production companies are there?

#Add your code here

Question 8 - Creating the User Input Function (Manually graded) 5 points

We finally have all the pieces to create a function that returns the top N movies based on the IMDB score and the filter you wrote. We're going to to modify/expand on the build_chart function from the lesson. Once again, we'll give you the function definition in the cell below. We are also giving you the weighted_rating function. Be sure to run that cell.

Your build_chart function should take in:

  • the dataframe to filter

  • the filter function (you've already written this)

  • the rater function (provided below)

  • a parameter called "filter_location" which should be either the string 'before' or the string 'after' (filter before or after computing m and C and scoring)

  • a number of movies to return.

  • use the 80th percentile to compute m

The function should return the top 'n' rows of a dataframe sorted in descending order of the score column. It will return whatever columns you pass in.

Unlike the lesson, we do not want you to filter out the movies with low vote counts!

There are two approaches to writing the build_chart function presented in the lesson.

The first prompts the user to input the values used for filtering,
the second approach allows the values to be passed as arguments to the build_chart function.

We recommend the second approach as it's much easier for testing and development.

#not included in quiz/solutions ################# # Provided code. Run this cell ################# # Function to compute the IMDB weighted rating for each movie def weighted_rating(x, m, C): v = x['vote_count'] R = x['vote_average'] # Compute the weighted score return (v/(v+m) * R) + (m/(m+v) * C)
def build_chart(df, filter_func, rater, filter_location, n=10): ''' Parameters df: the dataframe to that will be filtered, scored, sorted (not necessarily in that order) filter: the function that's being used to filter (in this case, it would be the filterMovies function) rater: the function used to rate or score each movie (in this case, it would be the weighted_rating function) filter_location: either the string 'before' or the string 'after.' If 'before' is passed, the filter will be applied before scoring. If after, it will be applied after. n: the number of rows to return. Defaults to 10 Returns The top n rows of the sorted dataframe ''' #Add your code here

Hint: if you run the cell below, the first movie returned should be Monty Python and the Holy Grail

#not included in quiz/solutions #Use the following: 'action', 'adventure', 'Beijing New Picture Film Co. Ltd.', 1000000, filtered after build_chart(movies, filterMovies, weighted_rating, 'after', 5) # modify to pass filtering values ...

Question 9 - Testing Your Function 2 points

Feel free to modify your build_chart function to allow the inputs to be passed to the function as we did in the lesson. It makes for easier testing.

Run your build_chart with the following parameters:

  • genre 1 = horror

  • genre 2 = mystery

  • production company = Paramount Pictures

  • max budget = 1500000

  • n = 7

  • filter before scoring

What is the final movie in your chart?

  • The Evil Dead

  • Night of the Living Dead

  • Saw

  • Eraserhead

  • Rebecca

#add your code here

Question 10 - Filter After 2 points

Now use the same parameters, but perform the filter after you apply the scores.

What is the final movie in your chart?

  • The Evil Dead

  • Night of the Living Dead

  • Insidious

  • Eraserhead

  • Rebecca

#Add your code here

Preparing to Build a Content-Based Recommender

In this section of the homework, you will prepare to build a content-based recommender that can flexibly use either CountVectorizer or TfidfVectorizer. We're including our lemmatization setup code for you. Run the cell below then proceed to part a.

#not included in quiz/solutions ################################# # This cell does all the set up work - you only need to run this once per notebook ################################# #Create a helper function to get part of speech def get_wordnet_pos(word, pretagged = False): """Map POS tag to first character lemmatize() accepts""" if pretagged: tag = word[1].upper() else: tag = nltk.pos_tag([word])[0][1][0].upper() tag_dict = {"J": wn.ADJ, "N": wn.NOUN, "V": wn.VERB, "R": wn.ADV} return tag_dict.get(tag, wn.NOUN) #create a tokenizer that uses lemmatization (word shortening) class LemmaTokenizer(object): def __init__(self): self.wnl = WordNetLemmatizer() def __call__(self, articles): #get the sentences sents = sent_tokenize(articles) #get the parts of speech for sentence tokens sent_pos = [nltk.pos_tag(word_tokenize(s)) for s in sents] #flatten the list pos = [item for sublist in sent_pos for item in sublist] #lemmatize based on POS (otherwise, all words are nouns) lems = [self.wnl.lemmatize(t[0], get_wordnet_pos(t, True)) for t in pos if t[0] not in string.punctuation] #clean up in-word punctuation lems_clean = [''.join(c for c in s if c not in string.punctuation) for s in lems] return lems_clean #lemmatize the stop words lemmatizer = WordNetLemmatizer() lemmatized_stop_words = [lemmatizer.lemmatize(w) for w in text.ENGLISH_STOP_WORDS] #extend the stop words with any other words you want to add, these are bits of contractions lemmatized_stop_words.extend(['ve','nt','ca','wo','ll'])

Question 11 - Create the fetchSimilarityMatrix Function (Manually Graded) 5 points

We know that we have two kinds of vectorization we can do, and each requires a slightly different similarity matrix. Let's create a wrapper function that has the following parameters:

  • df: the dataframe holding our data

  • soupCol: the string name of the column holding our soup (this should already be ready to go - you shouldn't be creating your soup inside this function)

  • vectorizer: an initialized vectorizer. This will either be a TfidfVectorizer or a CountVectorizer

  • vectorType: a string representing either Tfidf or Count to indicate which type of vectorizer we are using

Inside your function, you'll:

  • make sure your soup has no NaN (fill with empty strings)

  • fit_transform your soup into a number matrix

  • if the vector type is 'Tfidf', use the linear_kernel() function to generate a similarity matrix

  • if the vector type is 'Count', use the cosine_similarity() function to generate a similarity matrix

  • return the sparse similarity matrix

def fetchSimilarityMatrix(df, soupCol, vectorizer, vectorType='Tfidf'): ''' Parameters df: the dataframe containing a soup column to tranform soupCol: The string title of the soup column (or any column containing strings for which you want similarities) vectorizer: an initialized vectorizer, with all pre-processing you desire vectorType: 'Tfidf' or 'Count' - representing the type of vectorizer you used. Returns Sparse Similarity Matrix ''' #Add your code here # 1. remove NaN from df for the designated column # 2. compute the count or tfidf matrix using the vectorizer you passed in (use fit_transform) #apply the appropriate vectorizer if(vectorType=='Tfidf'): print('Using Linear Kernel (Tfidf)') # 3. compute the sim matrix as shown in the lesson for a tfidf matrix, call it sim else: print('Using Cosine_similarity') # 4. compute the sim matrix as shown in the lesson for a count matrix, call it sim return(sim)

Hint: Running the code below should return 0.2

#hint code: not included in quiz/solutions # Read in some ted talk data ted = pd.read_csv('data/ted-simplified.csv') #Define a TF-IDF Vectorizer Object. Use the LemmaTokenizer defined above, convert to lowercase, and remove stopwords, and only use the top 100 features. tfidf = TfidfVectorizer(tokenizer=LemmaTokenizer(), lowercase=True, stop_words=lemmatized_stop_words, max_features = 100) sim = fetchSimilarityMatrix(ted, 'description', tfidf, 'Tfidf') round(sim[1,0], 2)
Using Linear Kernel (Tfidf)
0.2

Question 12 - Test Your fetchSimilarityMatrix Function 2 points

Using the ted data we read in for you above, initialize a CountVectorizer that uses 'english' stop words, lowercase, and all the features. Call the fetchSimilarityMatrix function, using the column 'topics' for your soup.

What is the value [0,2] position in your matrix (rounded to 2 digits)?

#Add your code here

Question 13 - Preparing the Movies Metadata Soup (Manually Graded) 5 points

For this problem we'll be using the same data set tmdb-simplified.csv to build a meta-data based recommender by creating a "soup" based on:

  • all genres

  • all keywords

  • all production companies

You will need to sanitize the production companies and the keywords (but not genres). Review the self-assessment solution for code to sanitize.

Make sure that you concatenate the columns in the order listed (genres, then keywords, then production companies).

Do not reload the data, just use the datframe you created and filtered in Question 3.

#Add your code here

Question 14 What is the soup for Spider-Man 3? 2 points

If your soup text has text that looks like this: "d u a l i d e n t i t y" it's probably because you applied .join()
to a string and not a list of strings. Make sure .join() is only applied to lists of strings.

There are lots of different ways to extract text from a Pandas dataframe. You can use whatever way you choose, just make sure that you're able to see the complete text. Spider-Man 3 should be the 6th row in your dataframe (so with zero-based indexes, that would be [5]. We recommend that you confirm that you're reviewing the correct row. Once you're sure you're looking at the correct row, select which of the following is the correct soup for Spider-Man 3.

  • fantasy action adventure dualidentity amnesia sandstorm columbiapictures lauraziskinproductions marvelenterprises

  • fantasy action adventure dualidentityamnesiasandstorm columbiapictures lauraziskinproductions marvelenterprises

  • fantasy action adventure dual identity amnesia sandstorm Columbia Pictures Laura Ziskin Productions Marvel Enterprises

  • fantasy action adventure d u a l i d e n t i t y a m n e s i a s a n d s t o r m columbiapictures lauraziskinproductions marvelenterprises

#Add your code here

Question 15 Create Your Movie Similarity Matrix (Manually Graded) 2 points

Instantiate a CountVectorizer instance, converting to lowercase and removing 'english' stop words and a maximum of 1000 features. Using this instance and your fetchSimilarityMatrix function, fetch the appropriate similarity matrix for the movie df's "soup" column. Do not use LemmaTokenizer this time.

#Add your code here

Question 16 Determine Similarity between two movies 1 points

There are many ways to use the matrix to determine the similarity between any two movies. In the cell below, we determine the similarity between 'Spider-Man 3' and 'The Dark Knight Rises' rounded to 2 decimal places. Do not use LemmaTokenizer this time.

Hint: it should be 0.11

Based on this sample code, determine the similarity between 'Primer' and 'Avatar', rounded to 2 decimal places.

#hint code simdf = pd.DataFrame(sim, index=movies['title'], columns=movies['title']) round(simdf['Spider-Man 3']['The Dark Knight Rises'], 2)
0.11
#Add your code here

Question 17 Generating Recommendations from the MetaData Soup 2 points

Finally! We have all our pieces and we can run our meta-data based content recommender. Use the pieces that you've done so far and the content_recommender function from the lesson (copied for you below) to determine the top 5 movies related to the "title" (that's your seed column) of "Spider-Man 3" - based on the similarity matrix you've already generated above.

What is the top movie?

  • The Amazing Spider Man

  • The Moneky King 2

  • Spider-Man 2

  • The Broadway Melody

  • Krull

#not included in quiz/solutions def content_recommender(df, seed, seedCol, sim_matrix, topN=5): #get the indices based off the seedCol indices = pd.Series(df.index, index=df[seedCol]).drop_duplicates() # Obtain the index of the item that matches our seed idx = indices[seed] # Get the pairwsie similarity scores of all items and convert to tuples sim_scores = list(enumerate(sim_matrix[idx])) #delete the item that was passed in del sim_scores[idx] # Sort the items based on the similarity scores sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # Get the scores of the top-n most similar items. sim_scores = sim_scores[:topN] # Get the item indices movie_indices = [i[0] for i in sim_scores] # Return the topN most similar items return df.iloc[movie_indices]
#Add your code here

Question 18 - Using Just the Overview 2 points

Instead of using the soup, generate a similarity matrix using the 'overview' column. Since this is freeform text, use the Tfidf vectorizer. Pre-process the text by lemmatizing the words, using the lemmatized_stop_words. Again, limit to 1000 features. Generate the top 5 recommendations for 'Spider-Man 3' again.

Hint: You should only need a few lines of code here...

What is the top movie?

  • The Amazing Spider Man

  • The Monkey King 2

  • Spider-Man 2

  • The Broadway Melody

  • Krull

#Add your code here

Question 19 - Using N-Grams of the Overview 2 points

Generate a similarity matrix using just 3 word phrases (n-grams) of the 'overview' column. Since this is freeform text, use the Tfidf vectorizer. Pre-process the text by lemmatizing the words, using the lemmatized_stop_words. Again, limit to 1000 features. Generate the top 5 recommendations for 'Spider-Man 3' again.

Hint: You should only need a few lines of code here...

What is the top movie?

  • The Amazing Spider Man

  • Pirates of the Caribbean: At World's End

  • John Carter

  • Spider-Man

  • Avatar

#Add your code here

Question 20 Soup + Overview 2 points

Now add the overview to your soup. Since we do not want the genres and keywords down-weighted for describing multiple movies, use a CountVectorizer with lemmatization and the lemmatized_stop_words. Once again, limit your features to 1000. (We're limiting features here just to speed up processing time.) Again find recommendations for 'Spider-Man 3.'

What is the top movie?

  • Spider-Man

  • The Amazing Spider-Man 2

  • Avatar

  • Escape from Planet Earth

  • Krull

#Add your code here