Path: blob/main/Lessons/Lesson 13 - RecSys 1/Lesson_13.ipynb
871 views
Week 13: Recommender Systems 1
Recommender Systems are a complicated topic. This week's lesson and next week's lesson are intended to just give you a small taste of what recommender systems can do. To truly implement a recommender system, you'd want to dive much deeper. It's a topic worthy of more study. Recommender systems are all around you. Let's explore one recommender system you're probably familiar with, Netflix.
This week we'll look at two types of recommenders: Knowledge-Based Recommenders and Content-Based Recommenders. These systems are both used to address the "cold-start" problem. The cold-start problem is the idea that when you first launch a recommender site, you probably don't have data about what a user likes or how your items have been rated by other users. This lack of data makes it difficult to return really good recommendations, as most of the best recommenders rely on having ratings data. It also makes it impossible to quantify the quality of your recommendations. If you were building recommenders like this in real life, you might want to do some user-testing to validate if your recommender system is making good suggestions. Here, we'll just rely on our own best judgement.
Knowledge-Based Recommender
The knowledge-based recommender is just a simple ranked chart of data built using some input from the user. Banik describes it as a recommender that:
Gets user input on their preferences.
Extracts all the records that match the conditions set by the user.
Calculates a score to sort the records.
Returns the sorted results.
The Data
Like most data science projects, preparing/wrangling/cleaning your data for your recommender system can be the hardest part. In the lessons and homework, we'll be providing you data that has already been simplified to some extent, so we can focus more on the concepts of recommender systems. For our knowledge-based recommender, we're going to be using the movies_metadata_clean.csv file in the data directory of the lesson. If you'd like to see how we got to this file from the movies_metadata.csv file used in the book, please review the "DataCleaning" notebook in the extras directory.
We can see that genres looks like a list. To get Python to treat it like a list, we'll need to apply literal_eval
Avoiding "Explode"
Banik would have you "explode" the genres. He does this to make it easier to filter on the genres column, but it has an unfortunate side effect of making duplicate rows per movie. It's easy to avoid that by using a different method of filtering. Let's see how that works.
Fetching unique values from a column of lists
The one other thing we need to be able to do where the exploding helps is to get the list of unique genres. Let's look at how we could do that. We'll do it slightly differently than Banik. And, we'll first break it down into steps so you can see what's happening.
Self-Assessment: Modularize Fetching Unique Items
Getting a unique list of items from a Pandas column that contains lists is exactly the kind of thing you might have to do fairly often. Even though Pandas makes this fairly easy, let's modularize the code. Write a function that takes in a dataframe, the name of the column from which to pull unique lists, a type parameter to determine if you should return a numpy array or a string, and a True/False sort parameter to determine if the returned list should be sorted. Use the provided code below to get you started.
Creating a metric
Knowledge-based recommenders rely on some sort of score for returning their results. There's no right answer for what the score is or how it should be calculated. You'll need to consider the data that you're working with and decide how to calculate a meaningful score. If you are running an e-commerce site, you might use products with the highest percent off sale. If you're recommending library books, you might use a score built off of the number of times the book has been checked out, weighted by the number of weeks since it was first in circulation.
For this example, we're interested in highly-rated movies. But, it makes sense to consider both the average rating for a movie and the number of people that rated the movie. Say you have a rating scale from 1 to 5 stars. Your highest average rating would be 5 - a perfect score. But does a rating of 5 by one user mean the same thing as a rating of 5 by 100 users? Probably not.
Banik solves this problem by using the IMDB weighted rating.
Where:
v is the number of votes garnered by the movie
m is the minimum number of votes required for the movie to be in the chart (the prerequisite)
R is the mean rating of the movie
C is the mean rating of all the movies in the dataset
Banik chose the 80th percentile for the minimum number of votes to be included in the recommender.
Note that Banik chooses m (our minimum number of votes) based on the whole dataset, because IMDB sets this as the minimum threshold for being included in the ratings. So m is both a part of the metric AND a filter.
Let's also base C on the whole dataset to start.
Let's fetch C and m and filter to movies that have vote_counts greater than or equal to the 80th quantile. (This is equivalent to getting the top 20% of votes.)
Let's write our function to do scoring. Note that unlike Banik's metric, our version takes in x (the row of data) and m & C. Passing all the variables you need into the the function is a best practice that Banik does not follow, but we do.
Let's apply the score to the unfiltered dataframe.
What happens if use C from just the filtered dataframe?
You can see that it does indeed make a difference in the score. But would it make a difference in our recommendations?
Self-Assessment: Load and Display
Load the data set ted_clean.csv and display the first 5 rows. This data set can be found in the data folder for this lesson. More information about this data set here .
Self-Assessment: Pandas
How many talks are in the TED Talks data frame?
Self-Assessment: Prerequisites
Select TED talks with these prerequisites:
talks with duration of at least 5 minutes (i.e. 300 seconds)
talks with only 1 speaker
talks in the top 90% of views (exclude the bottom 10%)
Also inspect the number of talks that made the cut.
Self-Assessment: Compute a Metric, Sort and Print
In the absence of numerical ratings here, use the ratio of the number of comments per 1000 views as a metric to sort the TED talks and print the 10 with the highest ratios.
Display only the description, the main speaker, and the number of views.
Putting it together
Now that we've seen the pieces of the knowledge-based recommender, all we need to do is wrap the pieces in a function and call that function with the appropriate arguments to be used to filter the dataframe.
Our function will take in a cleaned dataframe and a percentile to use for m. By default, the percentile will be .8. The remaining arguments are genre, low_time, etc. that are used for filtering the movies.
Note that the only changes we're making here from Banik's metric is to adjust how we do the genres filter. We'll follow Banik's lead by calculated m after we filter the data to the user's selections.
We'll test our code by requesting movies that have the genres 'Family' and a runtime between 80 and 120 minutes and a year between 1980 and 2000. We'll return the output to a variable so we can review it in different ways. (Don't worry if your saved code has extra boxes or looks out of order. That's just a bug in CoCalc's output saving.)
We have two other versions of scores in the dataset. Let's compare the rankings if we sort in different ways.
Did the different methods of computing the score impact the recommendation results in a significant way?
Alternative to prompting for user input
Using input()
is fine for an application, but if you're just trying to test your recommender system it's easier to pass in the arguments. Here we've rewritten so that the default inputs are None
and the user will be prompted for input, but the arguments can be passed to the function to allow for easier testing.
We'll test our code by requesting movies that have the genres 'Family' and a runtime between 80 and 120 minutes and a year between 1980 and 2000. We'll return the output to a variable so we can review it in different ways. This time we will simply pass the inputs through the function call:
Self-Assessment: Create the Knowledge-Based Recommender
For this example we will use the TED Talks data set that you have already loaded to build a knowledge-based recommender by soliciting the desired publication year and word rating from the user. We've already extracted the year from the publish date and set up the ratings column as a list of words. (You'll still need to apply literal_eval to it, if you didn't already. Do that outside of your function in a separate cell. If you apply literal_eval twice, it will cause an error.)
Print a list of the descriptive word ratings for the user to choose from. (Hint: follow the directions in this lesson.)
Ask the user to enter answers to the following questions:
Enter a descriptive word for rating.
Enter the earliest year published for the talk (between 2006 and 2017).
Enter the latest year published for the talk (between 2006 and 2017).
Consider only talks with the top 90% of views (after filtering based on user preferences).
Display the top 5 recommended talks according to the "comments per 1000 views" ratio (calculated AFTER doing steps 2 & 3).
Display only the main speaker, the name of the talk, the year published, and the comments per thousand views ratio.
Show the results for the word rating "obnoxious" and published years between 2009 and 2014.
Content-Based Recommender
There are a number of different approaches to doing content-based recommenders. What they all share is that they use some method to:
select and clean text
convert text to a numerical representation
use the numerical representation of the text to assign similarity scores between each pair or set of items.
Once our pairs of items are scored, given one "seed" item, we can then recommend items that are similar to that item.
Let's break this down into steps.
Selecting and Cleaning Text
Banik talks about either using metadata or descriptions as the text that represents your item. This is a bit of a simplification. Really you could use any text, or any combination of text, that you have available to you. You should put time and effort into deciding what text you'll use. Consider which textual elements best represent the essence of the items you're recommending. Here's where that user testing comes into play. Maybe you write two recommenders - one using metadata and one using descriptions and you get your family and friends to sit down and review the recommendations that they receive using each approach. Is one better than the other? Does one recommend entirely inappropriate things? (We've all heard the stories about recommender systems gone wrong.)
Whatever text you use, you most likely want to do some pre-processing of the text first. This step is referred to as "cleaning" the text. There are many, many possible steps you can take here. We'll demonstrate just a few of them. Banik doesn't really discuss this bit, but we think it's too important to gloss over. Here's a non-exhaustive list of things to consider doing to your text, once you've decided which bits of text you're going to use.
Converting it to lowercase.
Removing punctuation.
Stemming or Lemmatization.
Using N-grams or combining words.
Identifying and using specific parts of speech.
Entire courses have been written about natural language processing. We can't get into all the details, so we'll focus on just a few steps that you can take that integrate well with scikit learn's CountVectorizer and TfidfVectorizer - converting to lowercase, removing punctuation, lemmatizing (converting words to their "root" word), and n-grams. Remember, a CountVectorizer creates a matrix of all the words available in the text, and counts each occurrence of each word. Banik does a nice job explaining this so we won't delve too deeply into it here. We'll loop back to TfidfVectorizer in a bit.
Converting to lowercase, removing stop words, and removing punctuation.
These are by far the easiest. The vectorizers automatically remove most punctuation. Converting to lowercase is the default setting for both of the vectorizers. And removing a standard set of stop words can be done just by adding one parameter. We'll add both parameters explicitly just so you can see what happens. First we need some data to play with. Let's set that up. Here we're creating a dataframe with three "documents" and setting the index of the documents to the ID column.
When we set up the CountVectorizer, we'll tell it that we want it to convert everything to lowercase and we want to remove stop words. Then we'll fit_transform the text using the Count Vectorizer. That's where the magic happens. Calling fit_transform and passing in our data will create the sparse matrix of words and counts. Finally, we'll just convert it to a dense matrix for easy display.
You can see that after creating the matrix, we end up with five "features" - five unique words that hopefully provide some meaning. All of the prepositions, pronouns and articles were removed by applying the stop words.
N-Grams
What if we were interested in 2-word phrases. "Queen bee" has a different meaning than either queen or bee alone. Multi-word phrases are called n-grams, and the "n" can be replaced with any number. We could have bigrams (2-word phrases), trigrams (3-word phrases), and so-on. To use n-grams with either of our scikit-learn vectorizers, we add the ngram_range parameter, and we give it a tuple indicating the shortest and longest phrase we're interested in considering. Let's try an n-gram of 1 or 2 word phrases.
By adding in n-grams, we went from 5 features to 8 features. Note that this is a very small corpus of text. If we were analyzing large bodies of text, adding in n-grams can increase your feature space very quickly. Be cautious about using n-grams if you're working with large datasets. But, if you know your text has a lot of meaningful phrases, it might be an approach worth considering, particularly if you're getting poor results from single words.
Lemmatization
Finally, let's look at lemmatization. Lemmatizing a word means reducing it to its root word. For example, the root word for the verb "is" is "be." Roots for nouns are generally the singular form of the word. Root forms of verbs tend to be the present-tense form. Adding lemmatization is a bit more complicated. We need to create a custom tokenizer. The tokenizer is what the vectorizers use to chunk up the text into tokens (typically single words). We'll create a tokenizer that converts the words to their lemmas (root words) before the vectorizer creates the matrix. Don't panic. We're giving you the code here. There are two bits - one cell that you'd only need once and shouldn't have to alter at all, and then the code to actually set up the vectorizer, and call the fit_transform, which would only require passing in different data.
Now you can see we have just 4 features, with both document 1 and document 3 containing a reference to "bee." Lemmatization might be important if you're including free-form text. You would not need to use lemmatization if you were using standard metadata, such as keywords or genres. When using that kind of standard vocabulary, you should already have a smaller number of features and the words should be consistent across your rows of data.
Like many things in Data Science, there's no one "right" answer about how much to pre-process your data. Sometimes, the best thing you can do is try out different approaches and see which ones seem to work the best.
Term Frequency-Inverse Document Frequency
The Term Frequency Inverse Document Frequency Vectorizer works exactly the same way as the Count Vectorizer. Let's do some data cleaning of the "overview" column of our snip data set and create a Tfidf matrix. Since this is free-form text, we'll lemmatize the words, remove stop words and punctuation and convert to lowercase. Let's also use the max_features parameter (available in both Tfidf and Count vectorizers) to limit the features to the top 100.
Because we have such a small matrix, we can actually print it and take a peek at what's going on in there. In order to print a tfidf matrix, we need to convert it to something that can be displayed.
Note: You won't need to do this bit in your homework or self-assessment, but it's helpful in understanding what we have.
When we print out this matrix, we can see that "afraid" is an important word in the description of Toy Story, but not in any of our other movies. But "wedding" is important in both Grumpier Old Men and Father of the Bride.
If we create a heatmap of the features, we can visually scan to see which movies are more similar. Similar movies would have similar patterns of colors. Which movies look most similar? Which movies look the most different?
Computing Cosine Similarity
Visually looking for similarities doesn't really work well for very many items. Instead, let's determine how similar each movie is to each other movie mathematically. We'll use cosine similarity to do that. Cosine similarity compares the direction of vectors in multi-dimensional space. The less angle between each vector, the higher the cosine-similarity score. What now? Yep, it's a lot to wrap your head around. Just remember that each row of our matrix can be thought of as a vector that can be plotted in multi-dimensional space. It's easier if we think about a 2 dimensional problem. Let's say we had 2 characteristics and 4 movies that we've scored based on those characteristics. If we plot those on a graph and draw a line from the origin (0,0) through their score points, we'd have a vector for each of the characteristics pointing in specific directions. Let's look at an example.
If we imagine here the the X axis represents "Comedy" and the Y axis represents "Action", our green arrow might represent a movie that is action-packed and a little bit funny. Our red arrow might be something with just a little action and a lot of comedy. Our blue movie is somewhere right in between. Our black arrow is the exact opposite of our red arrow, and probably represents a slow-moving, serious movie.
When we quantify this with cosine similarity, we end up with a number for each pair of movies that's somewhere between -1 (total opposites) and 1 (exactly the same). Let's look at the cosine similiarity between some of our paris of movies.
Of note here is the the length of the vector doesn't matter when we're dealing with cosine similarity - only the angle between the two vectors matters. So, our red and blue movies score nearly a perfect match, even though the red movie is arguably a more comedic movie than the blue movie. That's just how these similarity measures work.
Dot Product (linear_kernel) vs. Cosine Similarity
Banik briefly mentions this, but it bears repeating. If you have data that's already normalized so that all the scores share the same magnitude, such as what happens with TfidfVectorizer (where each score is between 0 and 1), you can use the computationally cheaper dot product calculation using linear_kernel(). If you're using the CountVectorizer, you have to use the cosine similarity function - cosine_similarity() - which will normalize your data before calculating your vector distances. e
If you want to dive into the math of all this, the book "Practical Recommender Systems" by Kim Falk does a nice job of explaining and diving much deeper into these concepts.
With our snip dataset, we've constructed our vector matrix using Tfidf, so we can use the linear_kernel()/dot product method to get our similiarity scores.
Note that what we get is a matrix (which we've converted to a colored dataframe for display) that scores each movie in relation to another movie. A movie scored with itself will always be a one. Note that in our dataset here, Waiting to Exhale and Grumpier Old Men are our least similar two movies. Does that track with what you see in the heatmap above? Does that track with what you know of those two movies?
Which two movies are most similar?
Using the Similarity Matrix
Now that we have this matrix, we need to be able to use it to identify similar movies. Let's break down some of what Banik does.
Getting Index from Title
Banik uses a reverse mapping of indexes and titles to fetch data from the cosine similarity matrix. Let's take a look at what that is doing.
Let's also break down what's going on with converting our cosine similarity to a list of tuples.
The first thing we're doing is getting the row from the matrix that corresponds to the movie we want to review. Let's say we want to review Grumpier Old Men. Let's use the reverse mapping to get the index, and then fetch that row from our matrix.
The enumerate function loops over some iterable object and returns a counter and the value for each item in the iterable. We can see that what cosine_sim[2] returns is an array, which is an iterable object. We can't directly print the results from enumerate, so we have to wrap it in a list function.
What this results in is a list of tuples that correspond to the column number and the cosine similarity score for each movie that we compared to Grumpier Old Men. Which column number would be Grumpier Old Men compared with itself?
We can see that the most similar movie to Grumpier Old Men is... Grumpier Old Men. This makes sense - it's the same movie! We don't want that movie in our results, though. Since we know this is a balanced matrix (the indexes are the same for the columns and for the rows), we can just delete the item with our Grumpier Old Men index. Remember, that's 2. Let's see how that works.
Great. That got rid of the tuple that corresponded to the column Grumpier Old Men.
The next thing we do is to sort this list by the score (the second bit of the tuple). We're using a lambda function to do that. Let's see what we get when we sort.
We can see that column 4 and column 1 of our matrix contain our 2 most similar movies. But, what movies are those? We need to go back to our dataframe to figure that out. Let's extract just the indices for our top 2 movies. Finally, we'll use iloc to find the corresponding movie titles.
Recommender Function
Let's wrap this up in a function. We're going to do this slightly differently than Banik did.
We'll avoid giving it any variable defaults (which is a good practice unless you're hard-coding the defaults).
We'll pass in the string to identify the seed column and get the indices mapping inside the function. (The seed column is whatever column we're using to identify the item we're interested in using to recommend other items. In this case, it's a movie title. But it could be a place name, or a recipe title, or whatever, depending on our recommender.)
We'll also pass in the number of results to return. This is hard-coded number, so we will set a default for that.
We'll delete the passed-in movie explicitly, instead of assuming it's the first after sorting
We'll return the whole dataframe, not just the titles
Let's test our recommender with Grumpier Old Men and our snipped dataset again.
Content-Based Recommender Self Assessment
For this example we will use the TED Talks data set that you have already loaded to build a content-based recommender based on the descriptions of the talks. This will correspond to the plot description-based recommender.
Self-Assessment: TF-IDF Vectors
From the original TED Talks data frame that use in this lesson, create the TF-IDF (term frequency - inverse document frequency) matrix from the descriptions of the talks. The TF-IDF is high where a rare term is present or frequent in a document and TF-IDF is near zero where a term is absent from a document, or abundant across all documents.
The feature name in the data frame is description.
Preprocess the description column by using lower case letters, removing punctuation, removing the default English stop words, and using only bigrams.
Output the shape of the TF-IDF matrix you create. The number of rows corresponds to the number of TED talks in the data frame and the number of columns represents the number of unique terms.
Self-Assessment: Create the Content-Based Recommender Based on Dot Product
Compute the dot product score for all of the TED talks in the data frame. Next build the recommender to request the name of a TED talk in the data frame and provide the top 5 recommended talks based on the similarity of the descriptions with the name of the talk supplied.
Show that it works by getting the top 5 recommended talks that are similar to the talk named "Tyler Cowen: Be suspicious of simple stories" (from the name column of the data frame).
Content-Based Recommender using MetaData
We mentioned at the beginning that you could use any text you have available. But so far, we've only demonstrated using free-form text - like a description, synopsis, or overview. You can also use metadata that comes from more defined and finite lists. Banik demonstrates this with keywords and credits and we have that code for you in the Content Based Recommenders file in this directory.
There are some things consider when using the metadata to generate your list of features. First, if you have multiple metadata fields, you'll probably want to combine them. Banik calls this combined list of words the "soup." I think he's thinking of alphabet soup here, since we're throwing all the words together and stirring them up. That might be fine if each of your categorical lists of words are single words. But if you have categories that include phrases, you should consider pre-processing the phrases in such a way that the spaces between your words in the phrases are removed. For instance, remembering what you've learned so far, think about what would happen if you had the following records:
Business | Keywords |
---|---|
McDonald's | Fast Food, High Volume |
MJ's Finest | High Prices, Upscale |
Panera | Fast Casual |
If we were to vectorize the keywords as-is, we'd end up with the following words: fast, food, high, volume, prices, upscale, casual. But we'd lose the differentiation between high-volume and high-prices and fast-food and fast-casual. Banik uses "sanitizing" to prevent this kind of ambiguity. The sanitizing function is available in the book and in the Content Recommender chapter notebook.
We can also demonstrate the basic principles with our snip dataset. We already have our genres in a list and none of our snip movies have more than 3 genres and each of our genres are only 1 word. So we don't have to generate lists or sanitize anything. We simply need to create a soup of our overview and our genres.
Note: genres is a list, so we'll need to use ' '.join() to turn it into a string. Overview is a string, so we just need to add that string onto the end of the string created after ' '.join()ing the genres. Be sure to add a space in between.
Banik used a count vectorizer instead of a tf-idf vectorizer for his metadata recommender. He does this because using a tf-idf would downweight actors that appear in more than one movie. The same thing would happen with genres. So we'll follow suit and use the count vectorizer here. We'll remove stopwords and convert to lowercase and remove punctuation, but we won't use n-grams or lemmatization here.
You can read the documentation for count_vectorizer to learn more about it.
Even with this tiny dataset, switching between TF-IDF and CountVectorizer and the linear kernel and the cosine-similarity changed our top two results.
It's worth trying different approaches with your data to determine the right fit.
Self-Assessment: Metadata Recommender
Using the Ted Talks data, create a content recommender using the "ratings" and "tags" columns as your features. These columns include multi-word phrases, so use Banik's sanitize function on those two columns. Use a count vectorizer, removing English stop words and converting to lowercase, to create your vectorized matrix. Use cosine_similarity to compute your similarity matrix. Return the top 5 talks most closely related to the title "Humble plants that hide surprising secrets."