Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-13/code/starter-code/starter-code-13.ipynb
1905 views
Kernel: Python 2

Spacy Demo

If you haven't installed spacy yet, use:

conda install spacy python -m spacy.en.download

This downloads about 500 MB of data.

Another popular package, nltk, can be installed as follows (you can skip this for now):

conda install nltk python -m nltk.downloader all

This also downloads a lot of data

Load StumbleUpon dataset

# Unicode Handling from __future__ import unicode_literals import pandas as pd import json data = pd.read_csv("../../assets/dataset/stumbleupon.tsv", sep='\t', encoding="utf-8") data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', '')) data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', '')) data.head()
## Load spacy from spacy.en import English nlp_toolkit = English() nlp_toolkit

Another way to load spacy:

import spacy nlp_toolkit = spacy.load("en")
title = u"IBM sees holographic calls, air breathing batteries" parsed = nlp_toolkit(title) for (i, word) in enumerate(parsed): print "Word: {}".format(word) print "\t Phrase type: {}".format(word.dep_) print "\t Is the word a known entity type? {}".format( word.ent_type_ if word.ent_type_ else "No") print "\t Lemma: {}".format(word.lemma_) print "\t Parent of this word: {}".format(word.head.lemma_)

Investigate Page Titles

Let's see if we can find organizations in our page titles.

def references_organization(title): parsed = nlp_toolkit(title) return any([word.ent_type_ == 'ORG' for word in parsed]) data['references_organization'] = data['title'].fillna(u'').map(references_organization) # Take a look data[data['references_organization']][['title']].head()

Exercise:

Write a function to identify titles that mention an organization (ORG) and a person (PERSON).

. . . . . . . .

## Exercise solution def references_org_person(title): parsed = nlp_toolkit(title) contains_org = any([word.ent_type_ == 'ORG' for word in parsed]) contains_person = any([word.ent_type_ == 'PERSON' for word in parsed]) return contains_org and contains_person data['references_org_person'] = data['title'].fillna(u'').map(references_org_person) # Take a look data[data['references_org_person']][['title']].head()

Predicting "Greenness" Of Content

This dataset comes from stumbleupon, a web page recommender.

A description of the columns is below

FieldNameTypeDescription
urlstringUrl of the webpage to be classified
titlestringTitle of the article
bodystringBody text of article
urlidintegerStumbleUpon's unique identifier for each url
boilerplatejsonBoilerplate text
alchemy_categorystringAlchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_scoredoubleAlchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksizedoubleAverage number of words in each link
commonlinkratio_1double# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2double# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3double# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4double# of links sharing at least 1 word with 4 other links / # of links
compression_ratiodoubleCompression achieved on this page via gzip (measure of redundancy)
embed_ratiodoubleCount of number of usage
frameBasedinteger (0 or 1)A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatiodoubleRatio of iframe markups over total number of markups
hasDomainLinkinteger (0 or 1)True (1) if it contains an
html_ratiodoubleRatio of tags vs text in the page
image_ratiodoubleRatio of tags vs text in the page
is_newsinteger (0 or 1)True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomaininteger (0 or 1)True (1) if at least 3
linkwordscoredoublePercentage of words on the page that are in hyperlink's text
news_front_pageinteger (0 or 1)True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_charactersintegerPage's text's number of alphanumeric characters
numberOfLinksinteger Number of markups
numwords_in_urldoubleNumber of words in url
parametrizedLinkRatiodoubleA link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratiodoubleRatio of words not found in wiki (considered to be a spelling mistake)
labelinteger (0 or 1)User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

Let's try extracting some of the text content.

Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

# Option 1: Create a function to check for this def has_recipe(text_in): try: if 'recipe' in str(text_in).lower(): return 1 else: return 0 except: return 0 data['recipe'] = data['title'].map(has_recipe) # Option 2: lambda functions #data['recipe'] = data['title'].map(lambda t: 1 if 'recipe' in str(t).lower() else 0) # Option 3: string functions data['recipe'] = data['title'].str.contains('recipe')

Demo: Use of the Count Vectorizer

titles = data['title'].fillna('') from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(max_features = 1000, ngram_range=(1, 2), stop_words='english', binary=True) # Use `fit` to learn the vocabulary of the titles vectorizer.fit(titles) # Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams) X = vectorizer.transform(titles)

Demo: Build a random forest model to predict evergreeness of a website using the title features

from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators = 20) # Use `fit` to learn the vocabulary of the titles vectorizer.fit(titles) # Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams) X = vectorizer.transform(titles).toarray() y = data['label'] from sklearn.cross_validation import cross_val_score scores = cross_val_score(model, X, y, scoring='roc_auc') print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features

## TODO

Exercise: Build a random forest model to predict evergreeness of a website using the body features

## TODO

Exercise: Use TfIdfVectorizer instead of CountVectorizer - is this an improvement?

## TODO