GitHub Repository: DataScienceUWL/DS775
Path: blob/main/Lessons/Lesson 13 - RecSys 1/Self_Assess_Solns_13.ipynb
⁸⁷¹ views

Kernel: Python 3 (system-wide)

In [3]:

# EXECUTE FIRST

# computational imports
import numpy as np
import pandas as pd
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize    
nltk.download('averaged_perceptron_tagger')
from sklearn.feature_extraction import text
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet as wn
import string

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
from scipy.spatial import distance
# for reading files from urls
import urllib.request
# display imports
from IPython.display import display, IFrame
from IPython.core.display import HTML

Out[3]:

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

Lesson 13 - Self-Assessment Solutions

Self-Assessment: Modularize Fetching Unique Items

In [4]:


#this is a test dataframe to use
sa1_df = pd.DataFrame({
        'Food': ['Cake', 'Pie', 'Ice Cream'],
        'Flavors': [['Chocolate','Vanilla', 'Marble'], ['Apple', 'Chocolate', 'Cherry'], ['Vanilla', 'Cherry', 'Mint']]
    })
display(sa1_df)

def getUniqueListFromColumn(df, col, returntype = 'string', sort=True):
      
    #stack everything and get the unique values
    stacked = df.apply(lambda x:pd.Series(x[col], dtype='object'),axis=1).stack().unique()
    
    #if the user wants it sorted, sort it
    if sort:
        stacked = np.sort(stacked)
    
    #if the user wants a string back, join to give a string
    if returntype == 'string':
         stacked = ', '.join(stacked)
    
    return stacked

print(f"The sorted string list is: {getUniqueListFromColumn(sa1_df, 'Flavors', 'string', True)}")
print(f"The unsorted array is: {getUniqueListFromColumn(sa1_df, 'Flavors', 'array', False)}")

Out[4]:

The sorted string list is: Apple, Cherry, Chocolate, Marble, Mint, Vanilla
The unsorted array is: ['Chocolate' 'Vanilla' 'Marble' 'Apple' 'Cherry' 'Mint']

Self-Assessment: Load and Display - Solution

There's nothing too new here. You've done this kind of work before. What's more important here than the code is making sure you take a minute or two to understand the data you're pulling in. What columns do you have available to you? Which columns contain simple values and which columns contain lists. Think about how you could or couldn't use this data to make recommendations.

In [5]:

import pandas as pd
import numpy as np
from ast import literal_eval

ted = pd.read_csv('./data/ted_clean.csv')
#we need ratings to be literally evaluated before using it
ted['ratings'] = ted['ratings'].apply(literal_eval)
ted.head()

Out[5]:

Self-Assessment: Pandas - Solution

Remember that shape gives you the number of rows first, followed by the number of columns.

In [6]:

ted.shape

Out[6]:

(2550, 18)

There are 2550 TED talks in this data frame.

Self-Assessment: Prerequisites - Solution

Remember that when you're calculating the quantile for some piece of data, you'll get different results if you calculate it before or after you do your other subsetting. First, let's calculate the views quantile before we figure the rest of our prerequisites.

In [7]:

#Calculate the number of views for the 10th percentile - calculated from the whole dataframe
m = ted['views'].quantile(0.10)

#Only consider talks of at least 5 minutes
q_talks = ted[(ted['duration'] >= 300)]

#Only consider talks with one speaker
q_talks = q_talks[q_talks['num_speaker']==1]

#Only consider talks in the top 90%
q_talks = q_talks[q_talks['views'] >= m]

#Inspect the number of talks that made the cut
q_talks.shape[0]

Out[7]:

2107

Let's compare that with calculating the quantile after we subset.

In [8]:

#Only consider talks of at least 5 minutes
q_talks2 = ted[(ted['duration'] >= 300)]

#Only consider talks with one speaker
q_talks2 = q_talks2[q_talks2['num_speaker']==1]

#Calculate the number of views for the 10th percentile - calculated from the whole dataframe
m2 = q_talks2['views'].quantile(0.10)

#Only consider talks in the top 90%
q_talks2 = q_talks2[q_talks2['views'] >= m2]

#Inspect the number of talks that made the cut
q_talks2.shape[0]

Out[8]:

2093

There is no universally "right" answer as to whether you should calculate the quantile before or after you've narrowed the initial dataset. It depends on what you're trying to accomplish. If you want the most viewed talks that meet your criteria you'd calculate it after you've subsetted. If you want the most viewed talks overall you'd calculate it before you've subsetted.

For our homework, we'll either tell you when to subset a dataframe or ask you to make the decision and give a justification for your decision.

Self-Assessment: Compute a Metric, Sort and Print - Solution

Note that here we are computing our metric on our narrowed data set. We could have created the metric on the entire dataset. But, if we know that we're only interested in a portion of the talks, we should narrow our dataset before computing the metric.

In [9]:

#create the metric of the comments to views ratio
q_talks['comments_per_1000views']=1000*q_talks['comments']/q_talks['views']

#Sort talks in descending order of the ratio of views to comments
q_talks = q_talks.sort_values('comments_per_1000views', ascending=False)

#Print the top 10 talks
q_talks[['description', 'main_speaker', 'comments_per_1000views']].head(10)

Out[9]:

Self-Assessment: Create the Knowledge-Based Recommender - Solution

We're creating this as a function that takes in the dataframe and the percentile of views that we want to return. We'll first generate our list of unique words to present to users. We'll also stringify our list of ratings so we can use str.contains to filter.

In [10]:

def build_chart(gen_ted, percentile=0.1):
   
    #Ask for preferred word rating
    print("Select a descriptive word from the list above for the 'word rating'")
    rating = input()
    
    #Ask for lower limit of film year
    print("Input earliest year published (2006 to 2017)")
    low_year = int(input())
    
    #Ask for upper limit of film year
    print("Input latest year published(2006 to 2017)")
    high_year = int(input())
    
    
    #Define a new talks variable to store the preferred talks. 
    #Copy the contents of gen_ted to talks
    talks = gen_ted.copy()
    
    #Filter based on the condition
    talks = talks[(talks['ratings'].apply(lambda x: rating in x)) & 
                    (talks['published_year'] >= low_year) & 
                    (talks['published_year'] <= high_year)]
    
    #Calculate the number of views for the  percentile 
    m = talks['views'].quantile(percentile)

    #Only consider movies that have higher than m votes. Save this in a new dataframe q_movies (note using .loc here prevents a warning)
    q_talks = talks.copy().loc[talks['views'] >= m]
    
    #create the metric of the comments to views ratio
    q_talks['comments_per_1000views']=1000*q_talks['comments']/q_talks['views']

    #Sort talks in descending order of the ratio of views to comments
    q_talks = q_talks.sort_values('comments_per_1000views', ascending=False)
    
    return q_talks

In [9]:

#First we'll print a list of possible ratings
r = getUniqueListFromColumn(ted, 'ratings', 'string', True)
print(f'Please select a rating from the following: {r}')

#Generate the chart for top talks for these user preferences and display top 5.
#Show the results for the word rating "obnoxious" and published years between 2009 and 2014.
gen_ted_final = build_chart(ted).head(5)

gen_ted_final[['main_speaker','name','published_year','comments_per_1000views']]

Out[9]:

Please select a rating from the following: beautiful, confusing, courageous, fascinating, funny, informative, ingenious, inspiring, jaw-dropping, longwinded, obnoxious, ok, persuasive, unconvincing
Select a descriptive word from the list above for the 'word rating'

Input latest year published(2006 to 2017)

Self-Assessment: TF-IDF Vectors - Solution

This is all straight from the book. More information about the TfidfVectorizer is available online here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [11]:

#Define a TF-IDF Vectorizer Object. Remove all english stopwords
tfidf = TfidfVectorizer(stop_words='english', lowercase=True, ngram_range=(2,2))

#Replace NaN with an empty string
ted['description'] = ted['description'].fillna('')

#Construct the required TF-IDF matrix by applying the fit_transform method on the description feature
tfidf_matrix = tfidf.fit_transform(ted['description'])

#Output the shape of tfidf_matrix (rows first, then columns)
tfidf_matrix.shape

Out[11]:

(2550, 63416)

In [12]:

#bonus - take a look some of the individual words in the description
feature_names = tfidf.get_feature_names_out()
feature_names[500:510]

Out[12]:

array(['40 video', '40 years', '400 metric', '400 pounds', '400 years',
       '4000 year', '404 page', '404 pages', '413 billion', '45 story'],
      dtype=object)

In [13]:

#bonus - this is saying that for the first document, none of the 500th to 510th words shown above show up in that document
tfidf_list = tfidf_matrix.toarray()
tfidf_list[0, 500:510]

Out[13]:

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Self-Assessment: Create the Content-Based Recommender Based on Dot Product - Solution

This is also straight from the book. We don't expect you to understand everything to do with linear kernels. But if you're interested, the documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.linear_kernel.html

In [14]:

# Compute the dot product similarity matrix
sim_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

In [15]:

def content_recommender(df, seed, seedCol, sim_matrix,  topN=2): 
    #get the indices based off the seedCol
    indices = pd.Series(df.index, index=df[seedCol]).drop_duplicates()
    
    # Obtain the index of the item that matches our seed
    idx = indices[seed]
    
    # Get the pairwsie similarity scores of all items and convert to tuples
    sim_scores = list(enumerate(sim_matrix[idx]))
    
    #delete the item that was passed in
    del sim_scores[idx]
    
    # Sort the items based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the top-n most similar items.
    sim_scores = sim_scores[:topN]
    
    # Get the item indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the topN most similar items
    return df.iloc[movie_indices]

In [15]:

#Get recommendations for Tyler Cowen: Be suspicious of simple stories
content_recommender(ted, 'Tyler Cowen: Be suspicious of simple stories', 'name', sim_matrix, 10)

Out[15]:

Self-Assessment: Metadata Recommender

Reminder: You are using the ratings and the tags. Sanitize both first. Use all the words from each to make the soup.

In [16]:

# Function to sanitize data to prevent ambiguity. It removes spaces and converts to lowercase
def sanitize(x):
    if isinstance(x, list):
        #Strip spaces and convert to lowercase
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

ted = pd.read_csv('./data/ted_clean.csv')

#literal_eval and sanitize both columns
ted['ratings'] = ted['ratings'].apply(literal_eval).apply(sanitize)
ted['tags'] = ted['tags'].apply(literal_eval).apply(sanitize)

#Function that creates a soup out of the desired metadata
def create_soup(x):
    return ' '.join(x['ratings']) + ' ' + ' '.join(x['tags'])

#create a column with the soup in it    
ted['soup'] = ted.apply(create_soup, axis=1)   


print(f'The soup for {ted["title"][0]} is: \n{ted["soup"][0]}')

Out[16]:

The soup for Do schools kill creativity? is: 
funny beautiful ingenious courageous longwinded confusing informative fascinating unconvincing persuasive jaw-dropping ok obnoxious inspiring children creativity culture dance education parenting teaching

In [17]:

count = CountVectorizer(stop_words='english', lowercase=True)
count_matrix = count.fit_transform(ted['soup'])

#Compute the cosine similarity score 
cosine_sim = cosine_similarity(count_matrix, count_matrix)

#call our same function, using the same movie. 
content_recommender(ted, 'Humble plants that hide surprising secrets', 'title', cosine_sim, topN=5)

Out[17]:

In [0]: