Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
DataScienceUWL
GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Lessons/Lesson 13 - RecSys 1/Self_Assess_Solns_13.ipynb
871 views
Kernel: Python 3 (system-wide)
# EXECUTE FIRST # computational imports import numpy as np import pandas as pd from ast import literal_eval from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import linear_kernel import nltk from nltk.tokenize import sent_tokenize from nltk import word_tokenize nltk.download('averaged_perceptron_tagger') from sklearn.feature_extraction import text from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet as wn import string # plotting imports import matplotlib.pyplot as plt import seaborn as sns sns.set_style("darkgrid") from scipy.spatial import distance # for reading files from urls import urllib.request # display imports from IPython.display import display, IFrame from IPython.core.display import HTML
[nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /home/user/nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date!

Lesson 13 - Self-Assessment Solutions

Self-Assessment: Modularize Fetching Unique Items

#this is a test dataframe to use sa1_df = pd.DataFrame({ 'Food': ['Cake', 'Pie', 'Ice Cream'], 'Flavors': [['Chocolate','Vanilla', 'Marble'], ['Apple', 'Chocolate', 'Cherry'], ['Vanilla', 'Cherry', 'Mint']] }) display(sa1_df) def getUniqueListFromColumn(df, col, returntype = 'string', sort=True): #stack everything and get the unique values stacked = df.apply(lambda x:pd.Series(x[col], dtype='object'),axis=1).stack().unique() #if the user wants it sorted, sort it if sort: stacked = np.sort(stacked) #if the user wants a string back, join to give a string if returntype == 'string': stacked = ', '.join(stacked) return stacked print(f"The sorted string list is: {getUniqueListFromColumn(sa1_df, 'Flavors', 'string', True)}") print(f"The unsorted array is: {getUniqueListFromColumn(sa1_df, 'Flavors', 'array', False)}")
The sorted string list is: Apple, Cherry, Chocolate, Marble, Mint, Vanilla The unsorted array is: ['Chocolate' 'Vanilla' 'Marble' 'Apple' 'Cherry' 'Mint']

Self-Assessment: Load and Display - Solution

There's nothing too new here. You've done this kind of work before. What's more important here than the code is making sure you take a minute or two to understand the data you're pulling in. What columns do you have available to you? Which columns contain simple values and which columns contain lists. Think about how you could or couldn't use this data to make recommendations.

import pandas as pd import numpy as np from ast import literal_eval ted = pd.read_csv('./data/ted_clean.csv') #we need ratings to be literally evaluated before using it ted['ratings'] = ted['ratings'].apply(literal_eval) ted.head()

Self-Assessment: Pandas - Solution

Remember that shape gives you the number of rows first, followed by the number of columns.

ted.shape
(2550, 18)

There are 2550 TED talks in this data frame.

Self-Assessment: Prerequisites - Solution

Remember that when you're calculating the quantile for some piece of data, you'll get different results if you calculate it before or after you do your other subsetting. First, let's calculate the views quantile before we figure the rest of our prerequisites.

#Calculate the number of views for the 10th percentile - calculated from the whole dataframe m = ted['views'].quantile(0.10) #Only consider talks of at least 5 minutes q_talks = ted[(ted['duration'] >= 300)] #Only consider talks with one speaker q_talks = q_talks[q_talks['num_speaker']==1] #Only consider talks in the top 90% q_talks = q_talks[q_talks['views'] >= m] #Inspect the number of talks that made the cut q_talks.shape[0]
2107

Let's compare that with calculating the quantile after we subset.

#Only consider talks of at least 5 minutes q_talks2 = ted[(ted['duration'] >= 300)] #Only consider talks with one speaker q_talks2 = q_talks2[q_talks2['num_speaker']==1] #Calculate the number of views for the 10th percentile - calculated from the whole dataframe m2 = q_talks2['views'].quantile(0.10) #Only consider talks in the top 90% q_talks2 = q_talks2[q_talks2['views'] >= m2] #Inspect the number of talks that made the cut q_talks2.shape[0]
2093

There is no universally "right" answer as to whether you should calculate the quantile before or after you've narrowed the initial dataset. It depends on what you're trying to accomplish. If you want the most viewed talks that meet your criteria you'd calculate it after you've subsetted. If you want the most viewed talks overall you'd calculate it before you've subsetted.

For our homework, we'll either tell you when to subset a dataframe or ask you to make the decision and give a justification for your decision.

Self-Assessment: Compute a Metric, Sort and Print - Solution

Note that here we are computing our metric on our narrowed data set. We could have created the metric on the entire dataset. But, if we know that we're only interested in a portion of the talks, we should narrow our dataset before computing the metric.

#create the metric of the comments to views ratio q_talks['comments_per_1000views']=1000*q_talks['comments']/q_talks['views'] #Sort talks in descending order of the ratio of views to comments q_talks = q_talks.sort_values('comments_per_1000views', ascending=False) #Print the top 10 talks q_talks[['description', 'main_speaker', 'comments_per_1000views']].head(10)

Self-Assessment: Create the Knowledge-Based Recommender - Solution

We're creating this as a function that takes in the dataframe and the percentile of views that we want to return. We'll first generate our list of unique words to present to users. We'll also stringify our list of ratings so we can use str.contains to filter.

def build_chart(gen_ted, percentile=0.1): #Ask for preferred word rating print("Select a descriptive word from the list above for the 'word rating'") rating = input() #Ask for lower limit of film year print("Input earliest year published (2006 to 2017)") low_year = int(input()) #Ask for upper limit of film year print("Input latest year published(2006 to 2017)") high_year = int(input()) #Define a new talks variable to store the preferred talks. #Copy the contents of gen_ted to talks talks = gen_ted.copy() #Filter based on the condition talks = talks[(talks['ratings'].apply(lambda x: rating in x)) & (talks['published_year'] >= low_year) & (talks['published_year'] <= high_year)] #Calculate the number of views for the percentile m = talks['views'].quantile(percentile) #Only consider movies that have higher than m votes. Save this in a new dataframe q_movies (note using .loc here prevents a warning) q_talks = talks.copy().loc[talks['views'] >= m] #create the metric of the comments to views ratio q_talks['comments_per_1000views']=1000*q_talks['comments']/q_talks['views'] #Sort talks in descending order of the ratio of views to comments q_talks = q_talks.sort_values('comments_per_1000views', ascending=False) return q_talks
#First we'll print a list of possible ratings r = getUniqueListFromColumn(ted, 'ratings', 'string', True) print(f'Please select a rating from the following: {r}') #Generate the chart for top talks for these user preferences and display top 5. #Show the results for the word rating "obnoxious" and published years between 2009 and 2014. gen_ted_final = build_chart(ted).head(5) gen_ted_final[['main_speaker','name','published_year','comments_per_1000views']]
Please select a rating from the following: beautiful, confusing, courageous, fascinating, funny, informative, ingenious, inspiring, jaw-dropping, longwinded, obnoxious, ok, persuasive, unconvincing Select a descriptive word from the list above for the 'word rating'
Input latest year published(2006 to 2017)

Self-Assessment: TF-IDF Vectors - Solution

This is all straight from the book. More information about the TfidfVectorizer is available online here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

#Define a TF-IDF Vectorizer Object. Remove all english stopwords tfidf = TfidfVectorizer(stop_words='english', lowercase=True, ngram_range=(2,2)) #Replace NaN with an empty string ted['description'] = ted['description'].fillna('') #Construct the required TF-IDF matrix by applying the fit_transform method on the description feature tfidf_matrix = tfidf.fit_transform(ted['description']) #Output the shape of tfidf_matrix (rows first, then columns) tfidf_matrix.shape
(2550, 63416)
#bonus - take a look some of the individual words in the description feature_names = tfidf.get_feature_names_out() feature_names[500:510]
array(['40 video', '40 years', '400 metric', '400 pounds', '400 years', '4000 year', '404 page', '404 pages', '413 billion', '45 story'], dtype=object)
#bonus - this is saying that for the first document, none of the 500th to 510th words shown above show up in that document tfidf_list = tfidf_matrix.toarray() tfidf_list[0, 500:510]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Self-Assessment: Create the Content-Based Recommender Based on Dot Product - Solution

This is also straight from the book. We don't expect you to understand everything to do with linear kernels. But if you're interested, the documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.linear_kernel.html

# Compute the dot product similarity matrix sim_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)
def content_recommender(df, seed, seedCol, sim_matrix, topN=2): #get the indices based off the seedCol indices = pd.Series(df.index, index=df[seedCol]).drop_duplicates() # Obtain the index of the item that matches our seed idx = indices[seed] # Get the pairwsie similarity scores of all items and convert to tuples sim_scores = list(enumerate(sim_matrix[idx])) #delete the item that was passed in del sim_scores[idx] # Sort the items based on the similarity scores sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # Get the scores of the top-n most similar items. sim_scores = sim_scores[:topN] # Get the item indices movie_indices = [i[0] for i in sim_scores] # Return the topN most similar items return df.iloc[movie_indices]
#Get recommendations for Tyler Cowen: Be suspicious of simple stories content_recommender(ted, 'Tyler Cowen: Be suspicious of simple stories', 'name', sim_matrix, 10)

Self-Assessment: Metadata Recommender

Reminder: You are using the ratings and the tags. Sanitize both first. Use all the words from each to make the soup.

# Function to sanitize data to prevent ambiguity. It removes spaces and converts to lowercase def sanitize(x): if isinstance(x, list): #Strip spaces and convert to lowercase return [str.lower(i.replace(" ", "")) for i in x] else: #Check if director exists. If not, return empty string if isinstance(x, str): return str.lower(x.replace(" ", "")) else: return '' ted = pd.read_csv('./data/ted_clean.csv') #literal_eval and sanitize both columns ted['ratings'] = ted['ratings'].apply(literal_eval).apply(sanitize) ted['tags'] = ted['tags'].apply(literal_eval).apply(sanitize) #Function that creates a soup out of the desired metadata def create_soup(x): return ' '.join(x['ratings']) + ' ' + ' '.join(x['tags']) #create a column with the soup in it ted['soup'] = ted.apply(create_soup, axis=1) print(f'The soup for {ted["title"][0]} is: \n{ted["soup"][0]}')
The soup for Do schools kill creativity? is: funny beautiful ingenious courageous longwinded confusing informative fascinating unconvincing persuasive jaw-dropping ok obnoxious inspiring children creativity culture dance education parenting teaching
count = CountVectorizer(stop_words='english', lowercase=True) count_matrix = count.fit_transform(ted['soup']) #Compute the cosine similarity score cosine_sim = cosine_similarity(count_matrix, count_matrix) #call our same function, using the same movie. content_recommender(ted, 'Humble plants that hide surprising secrets', 'title', cosine_sim, topN=5)