Path: blob/master/april_18/lessons/lesson-13/code/solution-code/solution-code-13.ipynb
1905 views
Kernel: Python 2
In [1]:
Out[1]:
Predicting "Greenness" Of Content
This dataset comes from stumbleupon, a web page recommender.
A description of the columns is below
| FieldName | Type | Description |
|---|---|---|
| url | string | Url of the webpage to be classified |
| title | string | Title of the article |
| body | string | Body text of article |
| urlid | integer | StumbleUpon's unique identifier for each url |
| boilerplate | json | Boilerplate text |
| alchemy_category | string | Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com) |
| alchemy_category_score | double | Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com) |
| avglinksize | double | Average number of words in each link |
| commonlinkratio_1 | double | # of links sharing at least 1 word with 1 other links / # of links |
| commonlinkratio_2 | double | # of links sharing at least 1 word with 2 other links / # of links |
| commonlinkratio_3 | double | # of links sharing at least 1 word with 3 other links / # of links |
| commonlinkratio_4 | double | # of links sharing at least 1 word with 4 other links / # of links |
| compression_ratio | double | Compression achieved on this page via gzip (measure of redundancy) |
| embed_ratio | double | Count of number of usage |
| frameBased | integer (0 or 1) | A page is frame-based (1) if it has no body markup but have a frameset markup |
| frameTagRatio | double | Ratio of iframe markups over total number of markups |
| hasDomainLink | integer (0 or 1) | True (1) if it contains an |
| html_ratio | double | Ratio of tags vs text in the page |
| image_ratio | double | Ratio of |
| is_news | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news |
| lengthyLinkDomain | integer (0 or 1) | True (1) if at least 3 |
| linkwordscore | double | Percentage of words on the page that are in hyperlink's text |
| news_front_page | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is front-page news |
| non_markup_alphanum_characters | integer | Page's text's number of alphanumeric characters |
| numberOfLinks | integer Number of | markups |
| numwords_in_url | double | Number of words in url |
| parametrizedLinkRatio | double | A link is parametrized if it's url contains parameters or has an attached onClick event |
| spelling_errors_ratio | double | Ratio of words not found in wiki (considered to be a spelling mistake) |
| label | integer (0 or 1) | User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only |
Let's try extracting some of the text content.
Create a feature for the title containing 'recipe'. Is the % of evegreen websites higher or lower on pages that have recipe in the the title?
In [2]:
Demo: Use of the Count Vectorizer
In [3]:
Demo: Build a random forest model to predict evergreeness of a website using the title features
In [4]:
Out[4]:
CV AUC [ 0.78695201 0.80649177 0.80522998], Average AUC 0.799557921281
Exercise: Build a random forest model to predict evergreeness of a website using the title features and quantitative features
In [5]:
Out[5]:
CV AUC [ 0.78500263 0.79911166 0.79822481], Average AUC 0.794113032767
Exercise: Build a random forest model to predict evergreeness of a website using the body features
In [6]:
Out[6]:
CV AUC [ 0.83658603 0.84479776 0.83873979], Average AUC 0.840041195125
Exercise: Use TfIdfVectorizer instead of CountVectorizer - is this an improvement?
In [7]:
Out[7]:
CV AUC [ 0.84268957 0.85274835 0.8393846 ], Average AUC 0.844940841904
In [ ]: