Path: blob/main/Homework/Lesson 13 HW - RecSys1/Homework_13.ipynb
871 views
Week 13 Homework - Recommender Systems 1
When asking questions about homework in Piazza please use a tag in the subject line like HW1.3 to refer to Homework 1, Question 3. So the subject line might be HW1.3 question. Note there are no spaces in "HW1.3". This really helps keep Piazza easily searchable for everyone!
For full credit, all code in this notebook must be both executed in this notebook and copied to the Canvas quiz where indicated.
General Multiple Choice Questions
Question 1 2 points
When would you use dot-product similarity function?
To calculate the similarity matrix for a Tfidf Vector Matrix
To calculate the similarity matrix for a Count Vector Matrix
To standardize text to root words
To combine columns of text before vectorization
Question 2 2 points
What is lemmatization?
Shortening words by removing suffixes and prefixes
Standardizing text to their root words
Generating a matrix of word counts
Chunking text into multi-word phrases
Build a Knowledge-Based Recommender
You will be using the data set tmdb-simplified.csv to build a simple knowledge-based recommender system. This data set can be found in the data folder in the same folder as this notebook.
You will need to use the option encoding = "ISO-8859-1" in the read_csv function in order to open this file.
Read in the file to a variable called "movies" and review the data.
Apply literal_eval to the genres, keywords, and production_companies columns. (They are already lists, not dictionaries.)
Filter out movies that have have nothing or zero in the budget.
Determine how many rows are in this dataframe.
Note: This code is ungraded.
Question 3 - How many rows of data are there? 1 point
Question 4 - Prep Work & Building a Filter Function (manually graded) 5 points
Before we build the recommender function that allows for user input, we're going to write a filter function that takes in manual (coded) input and filters our dataframe. Your function should take in parameters for the dataframe, two genres, a production company, and max budget. The filter should identify movies that meet the following criteria:
Have either genre
Are NOT made by the production company (the production company is not in the list of production companies)
Have a budget that is less than or equal to the max budget.
The function should return the filtered dataframe.
We've given you the function definition. Fill in the code.
Use the examples given in the lesson and Banik's book as a guide. (Do not explode. Use the lesson approach.)
Hint: If you call your function with the following parameters, you should be left with 27 movies:
Question 5 Calling Your Filter Function 2 points
Call your function using the following parameters:
genres of 'action' and 'crime'
the production company 'Columbia Pictures'
max budget of 2 million (2000000).
Report how many movies are left.
Question 6 - Fetch the List of Unique Genres (multiple choice) 2 points
Using the examples from the lesson, generate a string of unique genres. Sort the genres alphabetically. Note: for Questions 5-7 you should be using the dataframe you produced in Question 3.
What is the 3rd word in the sorted string list?
fantasy
animation
comedy
adventure
crime
Question 7 - Count the Number of Unique Production Companies 2 points
Using the examples from the lesson, generate a numpy array of production companies and determine the length of that array. How many unique production companies are there?
Question 8 - Creating the User Input Function (Manually graded) 5 points
We finally have all the pieces to create a function that returns the top N movies based on the IMDB score and the filter you wrote. We're going to to modify/expand on the build_chart function from the lesson. Once again, we'll give you the function definition in the cell below. We are also giving you the weighted_rating function. Be sure to run that cell.
Your build_chart function should take in:
the dataframe to filter
the filter function (you've already written this)
the rater function (provided below)
a parameter called "filter_location" which should be either the string 'before' or the string 'after' (filter before or after computing m and C and scoring)
a number of movies to return.
use the 80th percentile to compute m
The function should return the top 'n' rows of a dataframe sorted in descending order of the score column. It will return whatever columns you pass in.
There are two approaches to writing the build_chart function presented in the lesson.
The first prompts the user to input the values used for filtering,
the second approach allows the values to be passed as arguments to the build_chart function.
We recommend the second approach as it's much easier for testing and development.
Hint: if you run the cell below, the first movie returned should be Monty Python and the Holy Grail
Question 9 - Testing Your Function 2 points
Feel free to modify your build_chart function to allow the inputs to be passed to the function as we did in the lesson. It makes for easier testing.
Run your build_chart with the following parameters:
genre 1 = horror
genre 2 = mystery
production company = Paramount Pictures
max budget = 1500000
n = 7
filter before scoring
What is the final movie in your chart?
The Evil Dead
Night of the Living Dead
Saw
Eraserhead
Rebecca
Question 10 - Filter After 2 points
Now use the same parameters, but perform the filter after you apply the scores.
What is the final movie in your chart?
The Evil Dead
Night of the Living Dead
Insidious
Eraserhead
Rebecca
Preparing to Build a Content-Based Recommender
In this section of the homework, you will prepare to build a content-based recommender that can flexibly use either CountVectorizer or TfidfVectorizer. We're including our lemmatization setup code for you. Run the cell below then proceed to part a.
Question 11 - Create the fetchSimilarityMatrix Function (Manually Graded) 5 points
We know that we have two kinds of vectorization we can do, and each requires a slightly different similarity matrix. Let's create a wrapper function that has the following parameters:
df: the dataframe holding our data
soupCol: the string name of the column holding our soup (this should already be ready to go - you shouldn't be creating your soup inside this function)
vectorizer: an initialized vectorizer. This will either be a TfidfVectorizer or a CountVectorizer
vectorType: a string representing either Tfidf or Count to indicate which type of vectorizer we are using
Inside your function, you'll:
make sure your soup has no NaN (fill with empty strings)
fit_transform your soup into a number matrix
if the vector type is 'Tfidf', use the linear_kernel() function to generate a similarity matrix
if the vector type is 'Count', use the cosine_similarity() function to generate a similarity matrix
return the sparse similarity matrix
Hint: Running the code below should return 0.2
Question 12 - Test Your fetchSimilarityMatrix Function 2 points
Using the ted data we read in for you above, initialize a CountVectorizer that uses 'english' stop words, lowercase, and all the features. Call the fetchSimilarityMatrix function, using the column 'topics' for your soup.
What is the value [0,2] position in your matrix (rounded to 2 digits)?
Question 13 - Preparing the Movies Metadata Soup (Manually Graded) 5 points
For this problem we'll be using the same data set tmdb-simplified.csv to build a meta-data based recommender by creating a "soup" based on:
all genres
all keywords
all production companies
You will need to sanitize the production companies and the keywords (but not genres). Review the self-assessment solution for code to sanitize.
Make sure that you concatenate the columns in the order listed (genres, then keywords, then production companies).
Do not reload the data, just use the datframe you created and filtered in Question 3.
Question 14 What is the soup for Spider-Man 3? 2 points
to a string and not a list of strings. Make sure .join() is only applied to lists of strings.
There are lots of different ways to extract text from a Pandas dataframe. You can use whatever way you choose, just make sure that you're able to see the complete text. Spider-Man 3 should be the 6th row in your dataframe (so with zero-based indexes, that would be [5]. We recommend that you confirm that you're reviewing the correct row. Once you're sure you're looking at the correct row, select which of the following is the correct soup for Spider-Man 3.
fantasy action adventure dualidentity amnesia sandstorm columbiapictures lauraziskinproductions marvelenterprises
fantasy action adventure dualidentityamnesiasandstorm columbiapictures lauraziskinproductions marvelenterprises
fantasy action adventure dual identity amnesia sandstorm Columbia Pictures Laura Ziskin Productions Marvel Enterprises
fantasy action adventure d u a l i d e n t i t y a m n e s i a s a n d s t o r m columbiapictures lauraziskinproductions marvelenterprises
Question 15 Create Your Movie Similarity Matrix (Manually Graded) 2 points
Instantiate a CountVectorizer instance, converting to lowercase and removing 'english' stop words and a maximum of 1000 features. Using this instance and your fetchSimilarityMatrix function, fetch the appropriate similarity matrix for the movie df's "soup" column. Do not use LemmaTokenizer this time.
Question 16 Determine Similarity between two movies 1 points
There are many ways to use the matrix to determine the similarity between any two movies. In the cell below, we determine the similarity between 'Spider-Man 3' and 'The Dark Knight Rises' rounded to 2 decimal places. Do not use LemmaTokenizer this time.
Hint: it should be 0.11
Based on this sample code, determine the similarity between 'Primer' and 'Avatar', rounded to 2 decimal places.
Question 17 Generating Recommendations from the MetaData Soup 2 points
Finally! We have all our pieces and we can run our meta-data based content recommender. Use the pieces that you've done so far and the content_recommender function from the lesson (copied for you below) to determine the top 5 movies related to the "title" (that's your seed column) of "Spider-Man 3" - based on the similarity matrix you've already generated above.
What is the top movie?
The Amazing Spider Man
The Moneky King 2
Spider-Man 2
The Broadway Melody
Krull
Question 18 - Using Just the Overview 2 points
Instead of using the soup, generate a similarity matrix using the 'overview' column. Since this is freeform text, use the Tfidf vectorizer. Pre-process the text by lemmatizing the words, using the lemmatized_stop_words. Again, limit to 1000 features. Generate the top 5 recommendations for 'Spider-Man 3' again.
Hint: You should only need a few lines of code here...
What is the top movie?
The Amazing Spider Man
The Monkey King 2
Spider-Man 2
The Broadway Melody
Krull
Question 19 - Using N-Grams of the Overview 2 points
Generate a similarity matrix using just 3 word phrases (n-grams) of the 'overview' column. Since this is freeform text, use the Tfidf vectorizer. Pre-process the text by lemmatizing the words, using the lemmatized_stop_words. Again, limit to 1000 features. Generate the top 5 recommendations for 'Spider-Man 3' again.
Hint: You should only need a few lines of code here...
What is the top movie?
The Amazing Spider Man
Pirates of the Caribbean: At World's End
John Carter
Spider-Man
Avatar
Question 20 Soup + Overview 2 points
Now add the overview to your soup. Since we do not want the genres and keywords down-weighted for describing multiple movies, use a CountVectorizer with lemmatization and the lemmatized_stop_words. Once again, limit your features to 1000. (We're limiting features here just to speed up processing time.) Again find recommendations for 'Spider-Man 3.'
What is the top movie?
Spider-Man
The Amazing Spider-Man 2
Avatar
Escape from Planet Earth
Krull