GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Homework/Lesson 13 HW - RecSys1/Homework_13.ipynb
⁸⁷¹ views

Kernel: Python 3 (system-wide)

In [1]:

#Not included in Quiz/Solutions
# execute to import notebook styling for tables and width etc.
from IPython.core.display import HTML

# computational imports
import numpy as np
import pandas as pd
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize    
nltk.download('averaged_perceptron_tagger')
from sklearn.feature_extraction import text
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet as wn
import string

Out[1]:

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

Week 13 Homework - Recommender Systems 1

When asking questions about homework in Piazza please use a tag in the subject line like HW1.3 to refer to Homework 1, Question 3. So the subject line might be HW1.3 question. Note there are no spaces in "HW1.3". This really helps keep Piazza easily searchable for everyone!

For full credit, all code in this notebook must be both executed in this notebook and copied to the Canvas quiz where indicated.

General Multiple Choice Questions

Question 1 2 points

When would you use dot-product similarity function?

To calculate the similarity matrix for a Tfidf Vector Matrix
To calculate the similarity matrix for a Count Vector Matrix
To standardize text to root words
To combine columns of text before vectorization

Question 2 2 points

What is lemmatization?

Shortening words by removing suffixes and prefixes
Standardizing text to their root words
Generating a matrix of word counts
Chunking text into multi-word phrases

Build a Knowledge-Based Recommender

You will be using the data set tmdb-simplified.csv to build a simple knowledge-based recommender system. This data set can be found in the data folder in the same folder as this notebook.

You will need to use the option encoding = "ISO-8859-1" in the read_csv function in order to open this file.

Read in the file to a variable called "movies" and review the data.
Apply literal_eval to the genres, keywords, and production_companies columns. (They are already lists, not dictionaries.)
Filter out movies that have have nothing or zero in the budget.
Determine how many rows are in this dataframe.

Note: This code is ungraded.

Question 3 - How many rows of data are there? 1 point

You won't get the right answer if you don't filter out rows with NaNs and zero values for budget.

In [1]:

#Add your code here

Question 4 - Prep Work & Building a Filter Function (manually graded) 5 points

Before we build the recommender function that allows for user input, we're going to write a filter function that takes in manual (coded) input and filters our dataframe. Your function should take in parameters for the dataframe, two genres, a production company, and max budget. The filter should identify movies that meet the following criteria:

Have either genre
Are NOT made by the production company (the production company is not in the list of production companies)
Have a budget that is less than or equal to the max budget.

The function should return the filtered dataframe.

We've given you the function definition. Fill in the code.

Use the examples given in the lesson and Banik's book as a guide. (Do not explode. Use the lesson approach.)

In [1]:

def filterMovies(df, genre1, genre2, company, budgetmax):
    '''
    Parameters:
    df: The pandas dataframe to filter
    genre1: A possible genre
    genre2: Another possible genre
    company: A production company that can not be in the production company column
    budgetmax: The maximum budget allowed

    Returns: a filtered dataframe
    '''
    ##############
    #write your code here
    ##############

Hint: If you call your function with the following parameters, you should be left with 27 movies:

* genres of 'action' and 'adventure' 
* production company: 'Beijing New Picture Film Co. Ltd.'
* max budget: 1000000

Question 5 Calling Your Filter Function 2 points

Call your function using the following parameters:

genres of 'action' and 'crime'
the production company 'Columbia Pictures'
max budget of 2 million (2000000).

Report how many movies are left.

In [2]:

#Add  your code here

Question 6 - Fetch the List of Unique Genres (multiple choice) 2 points

Using the examples from the lesson, generate a string of unique genres. Sort the genres alphabetically. Note: for Questions 5-7 you should be using the dataframe you produced in Question 3.

What is the 3rd word in the sorted string list?

fantasy
animation
comedy
adventure
crime

In [3]:

#Add your code here

Question 7 - Count the Number of Unique Production Companies 2 points

Using the examples from the lesson, generate a numpy array of production companies and determine the length of that array. How many unique production companies are there?

In [4]:

#Add your code here

Question 8 - Creating the User Input Function (Manually graded) 5 points

We finally have all the pieces to create a function that returns the top N movies based on the IMDB score and the filter you wrote. We're going to to modify/expand on the build_chart function from the lesson. Once again, we'll give you the function definition in the cell below. We are also giving you the weighted_rating function. Be sure to run that cell.

Your build_chart function should take in:

the dataframe to filter
the filter function (you've already written this)
the rater function (provided below)
a parameter called "filter_location" which should be either the string 'before' or the string 'after' (filter before or after computing m and C and scoring)
a number of movies to return.
use the 80th percentile to compute m

The function should return the top 'n' rows of a dataframe sorted in descending order of the score column. It will return whatever columns you pass in.

Unlike the lesson, we do not want you to filter out the movies with low vote counts!

There are two approaches to writing the build_chart function presented in the lesson.

The first prompts the user to input the values used for filtering,
the second approach allows the values to be passed as arguments to the build_chart function.

We recommend the second approach as it's much easier for testing and development.

In [47]:

#not included in quiz/solutions
#################
# Provided code. Run this cell 
#################

# Function to compute the IMDB weighted rating for each movie
def weighted_rating(x, m, C):
    v = x['vote_count']
    R = x['vote_average']
    # Compute the weighted score
    return (v/(v+m) * R) + (m/(m+v) * C)

In [48]:


def build_chart(df, filter_func, rater, filter_location, n=10):
    '''
    Parameters
    df: the dataframe to that will be filtered, scored, sorted (not necessarily in that order)
    filter: the function that's being used to filter (in this case, it would be the filterMovies function)
    rater: the function used to rate or score each movie (in this case, it would be the weighted_rating function)
    filter_location: either the string 'before' or the string 'after.' If 'before' is passed, the filter will be applied before scoring. If after, it will be applied after.
    n: the number of rows to return. Defaults to 10

    Returns
    The top n rows of the sorted dataframe
    '''
    #Add your code here

Hint: if you run the cell below, the first movie returned should be Monty Python and the Holy Grail

In [0]:

#not included in quiz/solutions
#Use the following: 'action', 'adventure', 'Beijing New Picture Film Co. Ltd.', 1000000, filtered after
build_chart(movies, filterMovies, weighted_rating, 'after', 5) # modify to pass filtering values ...

Question 9 - Testing Your Function 2 points

Feel free to modify your build_chart function to allow the inputs to be passed to the function as we did in the lesson. It makes for easier testing.

Run your build_chart with the following parameters:

genre 1 = horror
genre 2 = mystery
production company = Paramount Pictures
max budget = 1500000
n = 7
filter before scoring

What is the final movie in your chart?

The Evil Dead
Night of the Living Dead
Saw
Eraserhead
Rebecca

In [5]:

#add your code here

Question 10 - Filter After 2 points

Now use the same parameters, but perform the filter after you apply the scores.

What is the final movie in your chart?

The Evil Dead
Night of the Living Dead
Insidious
Eraserhead
Rebecca

In [6]:

#Add your code here

Preparing to Build a Content-Based Recommender

In this section of the homework, you will prepare to build a content-based recommender that can flexibly use either CountVectorizer or TfidfVectorizer. We're including our lemmatization setup code for you. Run the cell below then proceed to part a.

In [8]:

#not included in quiz/solutions
#################################
# This cell does all the set up work - you only need to run this once per notebook
#################################

#Create a helper function to get part of speech
def get_wordnet_pos(word, pretagged = False):
    """Map POS tag to first character lemmatize() accepts"""
    if pretagged:
        tag = word[1].upper() 
    else:
        tag = nltk.pos_tag([word])[0][1][0].upper()
    
    tag_dict = {"J": wn.ADJ,
                "N": wn.NOUN,
                "V": wn.VERB,
                "R": wn.ADV}

    return tag_dict.get(tag, wn.NOUN)

#create a tokenizer that uses lemmatization (word shortening)
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        
        #get the sentences
        sents = sent_tokenize(articles)
        #get the parts of speech for sentence tokens
        sent_pos = [nltk.pos_tag(word_tokenize(s)) for s in sents]
        #flatten the list
        pos = [item for sublist in sent_pos for item in sublist]
        #lemmatize based on POS (otherwise, all words are nouns)
        lems = [self.wnl.lemmatize(t[0], get_wordnet_pos(t, True)) for t in pos if t[0] not in string.punctuation]
        #clean up in-word punctuation
        lems_clean = [''.join(c for c in s if c not in string.punctuation) for s in lems]
        return lems_clean 


    
#lemmatize the stop words
lemmatizer = WordNetLemmatizer()
lemmatized_stop_words = [lemmatizer.lemmatize(w) for w in text.ENGLISH_STOP_WORDS]
#extend the stop words with any other words you want to add, these are bits of contractions
lemmatized_stop_words.extend(['ve','nt','ca','wo','ll'])

Question 11 - Create the fetchSimilarityMatrix Function (Manually Graded) 5 points

We know that we have two kinds of vectorization we can do, and each requires a slightly different similarity matrix. Let's create a wrapper function that has the following parameters:

df: the dataframe holding our data
soupCol: the string name of the column holding our soup (this should already be ready to go - you shouldn't be creating your soup inside this function)
vectorizer: an initialized vectorizer. This will either be a TfidfVectorizer or a CountVectorizer
vectorType: a string representing either Tfidf or Count to indicate which type of vectorizer we are using

Inside your function, you'll:

make sure your soup has no NaN (fill with empty strings)
fit_transform your soup into a number matrix
if the vector type is 'Tfidf', use the linear_kernel() function to generate a similarity matrix
if the vector type is 'Count', use the cosine_similarity() function to generate a similarity matrix
return the sparse similarity matrix

In [7]:


def fetchSimilarityMatrix(df, soupCol, vectorizer, vectorType='Tfidf'):
    '''
    Parameters
    df: the dataframe containing a soup column to tranform
    soupCol: The string title of the soup column (or any column containing strings for which you want similarities)
    vectorizer: an initialized vectorizer, with all pre-processing you desire
    vectorType: 'Tfidf' or 'Count' - representing the type of vectorizer you used.

    Returns
    Sparse Similarity Matrix
    '''

    #Add your code here
    
    # 1. remove NaN from df for the designated column
    # 2. compute the count or tfidf matrix using the vectorizer you passed in (use fit_transform)

    #apply the appropriate vectorizer
    if(vectorType=='Tfidf'):
        print('Using Linear Kernel (Tfidf)')
        # 3.  compute the sim matrix as shown in the lesson for a tfidf matrix, call it sim
    else:
        print('Using Cosine_similarity')
        # 4.  compute the sim matrix as shown in the lesson for a count matrix, call it sim
    return(sim)

Hint: Running the code below should return 0.2

In [11]:

#hint code: not included in quiz/solutions
# Read in some ted talk data
ted = pd.read_csv('data/ted-simplified.csv')

#Define a TF-IDF Vectorizer Object. Use the LemmaTokenizer defined above, convert to lowercase, and remove stopwords, and only use the top 100 features.
tfidf = TfidfVectorizer(tokenizer=LemmaTokenizer(), lowercase=True, stop_words=lemmatized_stop_words, max_features = 100)

sim = fetchSimilarityMatrix(ted, 'description', tfidf, 'Tfidf')
round(sim[1,0], 2)

Out[11]:

Using Linear Kernel (Tfidf)

0.2

Question 12 - Test Your fetchSimilarityMatrix Function 2 points

Using the ted data we read in for you above, initialize a CountVectorizer that uses 'english' stop words, lowercase, and all the features. Call the fetchSimilarityMatrix function, using the column 'topics' for your soup.

What is the value [0,2] position in your matrix (rounded to 2 digits)?

In [8]:

#Add your code here

Question 13 - Preparing the Movies Metadata Soup (Manually Graded) 5 points

For this problem we'll be using the same data set tmdb-simplified.csv to build a meta-data based recommender by creating a "soup" based on:

all genres
all keywords
all production companies

You will need to sanitize the production companies and the keywords (but not genres). Review the self-assessment solution for code to sanitize.

Make sure that you concatenate the columns in the order listed (genres, then keywords, then production companies).

Do not reload the data, just use the datframe you created and filtered in Question 3.

In [9]:

#Add your code here

Question 14 What is the soup for Spider-Man 3? 2 points

If your soup text has text that looks like this: "d u a l i d e n t i t y" it's probably because you applied .join()
to a string and not a list of strings. Make sure .join() is only applied to lists of strings.

There are lots of different ways to extract text from a Pandas dataframe. You can use whatever way you choose, just make sure that you're able to see the complete text. Spider-Man 3 should be the 6th row in your dataframe (so with zero-based indexes, that would be [5]. We recommend that you confirm that you're reviewing the correct row. Once you're sure you're looking at the correct row, select which of the following is the correct soup for Spider-Man 3.

fantasy action adventure dualidentity amnesia sandstorm columbiapictures lauraziskinproductions marvelenterprises
fantasy action adventure dualidentityamnesiasandstorm columbiapictures lauraziskinproductions marvelenterprises
fantasy action adventure dual identity amnesia sandstorm Columbia Pictures Laura Ziskin Productions Marvel Enterprises
fantasy action adventure d u a l i d e n t i t y a m n e s i a s a n d s t o r m columbiapictures lauraziskinproductions marvelenterprises

In [10]:

#Add your code here

Question 15 Create Your Movie Similarity Matrix (Manually Graded) 2 points

Instantiate a CountVectorizer instance, converting to lowercase and removing 'english' stop words and a maximum of 1000 features. Using this instance and your fetchSimilarityMatrix function, fetch the appropriate similarity matrix for the movie df's "soup" column. Do not use LemmaTokenizer this time.

In [11]:

#Add your code here

Question 16 Determine Similarity between two movies 1 points

There are many ways to use the matrix to determine the similarity between any two movies. In the cell below, we determine the similarity between 'Spider-Man 3' and 'The Dark Knight Rises' rounded to 2 decimal places. Do not use LemmaTokenizer this time.

Hint: it should be 0.11

Based on this sample code, determine the similarity between 'Primer' and 'Avatar', rounded to 2 decimal places.

In [19]:

#hint code
simdf = pd.DataFrame(sim, index=movies['title'], columns=movies['title'])
round(simdf['Spider-Man 3']['The Dark Knight Rises'], 2)

Out[19]:

0.11

In [12]:

#Add your code here

Question 17 Generating Recommendations from the MetaData Soup 2 points

Finally! We have all our pieces and we can run our meta-data based content recommender. Use the pieces that you've done so far and the content_recommender function from the lesson (copied for you below) to determine the top 5 movies related to the "title" (that's your seed column) of "Spider-Man 3" - based on the similarity matrix you've already generated above.

What is the top movie?

The Amazing Spider Man
The Moneky King 2
Spider-Man 2
The Broadway Melody
Krull

In [21]:

#not included in quiz/solutions
def content_recommender(df, seed, seedCol, sim_matrix,  topN=5): 
    #get the indices based off the seedCol
    indices = pd.Series(df.index, index=df[seedCol]).drop_duplicates()
    
    # Obtain the index of the item that matches our seed
    idx = indices[seed]
    
    # Get the pairwsie similarity scores of all items and convert to tuples
    sim_scores = list(enumerate(sim_matrix[idx]))
    
    #delete the item that was passed in
    del sim_scores[idx]
    
    # Sort the items based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the top-n most similar items.
    sim_scores = sim_scores[:topN]
    
    # Get the item indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the topN most similar items
    return df.iloc[movie_indices]

In [13]:

#Add your code here

Question 18 - Using Just the Overview 2 points

Instead of using the soup, generate a similarity matrix using the 'overview' column. Since this is freeform text, use the Tfidf vectorizer. Pre-process the text by lemmatizing the words, using the lemmatized_stop_words. Again, limit to 1000 features. Generate the top 5 recommendations for 'Spider-Man 3' again.

Hint: You should only need a few lines of code here...

What is the top movie?

The Amazing Spider Man
The Monkey King 2
Spider-Man 2
The Broadway Melody
Krull

In [14]:

#Add your code here

Question 19 - Using N-Grams of the Overview 2 points

Generate a similarity matrix using just 3 word phrases (n-grams) of the 'overview' column. Since this is freeform text, use the Tfidf vectorizer. Pre-process the text by lemmatizing the words, using the lemmatized_stop_words. Again, limit to 1000 features. Generate the top 5 recommendations for 'Spider-Man 3' again.

Hint: You should only need a few lines of code here...

What is the top movie?

The Amazing Spider Man
Pirates of the Caribbean: At World's End
John Carter
Spider-Man
Avatar

In [15]:

#Add your code here

Question 20 Soup + Overview 2 points

Now add the overview to your soup. Since we do not want the genres and keywords down-weighted for describing multiple movies, use a CountVectorizer with lemmatization and the lemmatized_stop_words. Once again, limit your features to 1000. (We're limiting features here just to speed up processing time.) Again find recommendations for 'Spider-Man 3.'

What is the top movie?

Spider-Man
The Amazing Spider-Man 2
Avatar
Escape from Planet Earth
Krull

In [16]:

#Add your code here

Week 13 Homework - Recommender Systems 1

When asking questions about homework in Piazza please use a tag in the subject line like HW1.3 to refer to Homework 1, Question 3. So the subject line might be HW1.3 question. Note there are no spaces in "HW1.3". This really helps keep Piazza easily searchable for everyone!

General Multiple Choice Questions

Question 1 2 points

Question 2 2 points

Build a Knowledge-Based Recommender

Question 3 - How many rows of data are there? 1 point

Question 4 - Prep Work & Building a Filter Function (manually graded) 5 points

Question 5 Calling Your Filter Function 2 points

Question 6 - Fetch the List of Unique Genres (multiple choice) 2 points

Question 7 - Count the Number of Unique Production Companies 2 points

Question 8 - Creating the User Input Function (Manually graded) 5 points

Question 9 - Testing Your Function 2 points

Question 10 - Filter After 2 points

Preparing to Build a Content-Based Recommender

Question 11 - Create the fetchSimilarityMatrix Function (Manually Graded) 5 points

Question 12 - Test Your fetchSimilarityMatrix Function 2 points

Question 13 - Preparing the Movies Metadata Soup (Manually Graded) 5 points

Question 14 What is the soup for Spider-Man 3? 2 points

Question 15 Create Your Movie Similarity Matrix (Manually Graded) 2 points

Question 16 Determine Similarity between two movies 1 points

Question 17 Generating Recommendations from the MetaData Soup 2 points

Question 18 - Using Just the Overview 2 points

Question 19 - Using N-Grams of the Overview 2 points

Question 20 Soup + Overview 2 points

Product

Resources

Company