Path: blob/master/april_18/lessons/lesson-13/code/starter-code/starter-code-13.ipynb
1905 views
Kernel: Python 2
Spacy Demo
If you haven't installed spacy yet, use:
This downloads about 500 MB of data.
Another popular package, nltk
, can be installed as follows (you can skip this for now):
This also downloads a lot of data
Load StumbleUpon dataset
In [ ]:
In [ ]:
Another way to load spacy
:
In [ ]:
Investigate Page Titles
Let's see if we can find organizations in our page titles.
In [ ]:
Exercise:
Write a function to identify titles that mention an organization (ORG) and a person (PERSON).
. . . . . . . .
In [ ]:
Predicting "Greenness" Of Content
This dataset comes from stumbleupon, a web page recommender.
A description of the columns is below
FieldName | Type | Description |
---|---|---|
url | string | Url of the webpage to be classified |
title | string | Title of the article |
body | string | Body text of article |
urlid | integer | StumbleUpon's unique identifier for each url |
boilerplate | json | Boilerplate text |
alchemy_category | string | Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com) |
alchemy_category_score | double | Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com) |
avglinksize | double | Average number of words in each link |
commonlinkratio_1 | double | # of links sharing at least 1 word with 1 other links / # of links |
commonlinkratio_2 | double | # of links sharing at least 1 word with 2 other links / # of links |
commonlinkratio_3 | double | # of links sharing at least 1 word with 3 other links / # of links |
commonlinkratio_4 | double | # of links sharing at least 1 word with 4 other links / # of links |
compression_ratio | double | Compression achieved on this page via gzip (measure of redundancy) |
embed_ratio | double | Count of number of usage |
frameBased | integer (0 or 1) | A page is frame-based (1) if it has no body markup but have a frameset markup |
frameTagRatio | double | Ratio of iframe markups over total number of markups |
hasDomainLink | integer (0 or 1) | True (1) if it contains an |
html_ratio | double | Ratio of tags vs text in the page |
image_ratio | double | Ratio of |
is_news | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news |
lengthyLinkDomain | integer (0 or 1) | True (1) if at least 3 |
linkwordscore | double | Percentage of words on the page that are in hyperlink's text |
news_front_page | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is front-page news |
non_markup_alphanum_characters | integer | Page's text's number of alphanumeric characters |
numberOfLinks | integer Number of | markups |
numwords_in_url | double | Number of words in url |
parametrizedLinkRatio | double | A link is parametrized if it's url contains parameters or has an attached onClick event |
spelling_errors_ratio | double | Ratio of words not found in wiki (considered to be a spelling mistake) |
label | integer (0 or 1) | User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only |
Let's try extracting some of the text content.
Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?
In [ ]:
Demo: Use of the Count Vectorizer
In [ ]:
Demo: Build a random forest model to predict evergreeness of a website using the title features
In [ ]:
Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features
In [ ]:
Exercise: Build a random forest model to predict evergreeness of a website using the body features
In [ ]:
Exercise: Use TfIdfVectorizer
instead of CountVectorizer
- is this an improvement?
In [ ]:
In [ ]: