GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-13/code/starter-code/starter-code-13.ipynb
¹⁹⁰⁵ views

Kernel: Python 2

Spacy Demo

If you haven't installed spacy yet, use:

conda install spacy
python -m spacy.en.download

This downloads about 500 MB of data.

Another popular package, nltk, can be installed as follows (you can skip this for now):

conda install nltk
python -m nltk.downloader all

This also downloads a lot of data

Load StumbleUpon dataset

In [ ]:

# Unicode Handling
from __future__ import unicode_literals

import pandas as pd
import json

data = pd.read_csv("../../assets/dataset/stumbleupon.tsv", sep='\t',
                  encoding="utf-8")
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

In [ ]:

## Load spacy

from spacy.en import English
nlp_toolkit = English()
nlp_toolkit

Another way to load spacy:

import spacy
nlp_toolkit = spacy.load("en")

In [ ]:

title = u"IBM sees holographic calls, air breathing batteries"
parsed = nlp_toolkit(title)

for (i, word) in enumerate(parsed): 
    print "Word: {}".format(word)
    print "\t Phrase type: {}".format(word.dep_)
    print "\t Is the word a known entity type? {}".format(
        word.ent_type_  if word.ent_type_ else "No")
    print "\t Lemma: {}".format(word.lemma_)
    print "\t Parent of this word: {}".format(word.head.lemma_)

Investigate Page Titles

Let's see if we can find organizations in our page titles.

In [ ]:

def references_organization(title):
    parsed = nlp_toolkit(title)
    return any([word.ent_type_ == 'ORG' for word in parsed])

data['references_organization'] = data['title'].fillna(u'').map(references_organization)

# Take a look
data[data['references_organization']][['title']].head()

Exercise:

Write a function to identify titles that mention an organization (ORG) and a person (PERSON).

. . . . . . . .

In [ ]:

## Exercise solution

def references_org_person(title):
    parsed = nlp_toolkit(title)
    contains_org = any([word.ent_type_ == 'ORG' for word in parsed])
    contains_person = any([word.ent_type_ == 'PERSON' for word in parsed])
    return contains_org and contains_person

data['references_org_person'] = data['title'].fillna(u'').map(references_org_person)

# Take a look
data[data['references_org_person']][['title']].head()

Predicting "Greenness" Of Content

This dataset comes from stumbleupon, a web page recommender.

A description of the columns is below

FieldName	Type	Description
url	string	Url of the webpage to be classified
title	string	Title of the article
body	string	Body text of article
urlid	integer	StumbleUpon's unique identifier for each url
boilerplate	json	Boilerplate text
alchemy_category	string	Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score	double	Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize	double	Average number of words in each link
commonlinkratio_1	double	# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2	double	# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3	double	# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4	double	# of links sharing at least 1 word with 4 other links / # of links
compression_ratio	double	Compression achieved on this page via gzip (measure of redundancy)
embed_ratio	double	Count of number of usage
frameBased	integer (0 or 1)	A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio	double	Ratio of iframe markups over total number of markups
hasDomainLink	integer (0 or 1)	True (1) if it contains an
html_ratio	double	Ratio of tags vs text in the page
image_ratio	double	Ratio of tags vs text in the page
is_news	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain	integer (0 or 1)	True (1) if at least 3
linkwordscore	double	Percentage of words on the page that are in hyperlink's text
news_front_page	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters	integer	Page's text's number of alphanumeric characters
numberOfLinks	integer Number of	markups
numwords_in_url	double	Number of words in url
parametrizedLinkRatio	double	A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio	double	Ratio of words not found in wiki (considered to be a spelling mistake)
label	integer (0 or 1)	User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

Let's try extracting some of the text content.
Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [ ]:

# Option 1: Create a function to check for this

def has_recipe(text_in):
    try:
        if 'recipe' in str(text_in).lower():
            return 1
        else:
            return 0
    except: 
        return 0
        
data['recipe'] = data['title'].map(has_recipe)

# Option 2: lambda functions

#data['recipe'] = data['title'].map(lambda t: 1 if 'recipe' in str(t).lower() else 0)


# Option 3: string functions
data['recipe'] = data['title'].str.contains('recipe')

Demo: Use of the Count Vectorizer

In [ ]:

titles = data['title'].fillna('')

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles)

Demo: Build a random forest model to predict evergreeness of a website using the title features

In [ ]:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles).toarray()
y = data['label']

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

In [ ]:

## TODO

Exercise: Build a random forest model to predict evergreeness of a website using the body features

In [ ]:

## TODO

Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

In [ ]:

## TODO

In [ ]:

Spacy Demo

Load StumbleUpon dataset

Investigate Page Titles

Exercise:

Predicting "Greenness" Of Content

Let's try extracting some of the text content.

Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

Demo: Use of the Count Vectorizer

Demo: Build a random forest model to predict evergreeness of a website using the title features

Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

Exercise: Build a random forest model to predict evergreeness of a website using the body features

Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?

Product

Resources

Company

Spacy Demo

Load StumbleUpon dataset

Investigate Page Titles

Exercise:

Predicting "Greenness" Of Content

Let's try extracting some of the text content.

Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

Demo: Use of the Count Vectorizer

Demo: Build a random forest model to predict evergreeness of a website using the title features

Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

Exercise: Build a random forest model to predict evergreeness of a website using the body features

Exercise: Use TfIdfVectorizer instead of CountVectorizer - is this an improvement?

Exercise: Use `TfIdfVectorizer` instead of `CountVectorizer` - is this an improvement?