Path: blob/master/april_18/lessons/lesson-18/code/starter-code/starter-code-18.ipynb
1905 views
Predicting "Greenness" Of Content
This dataset comes from stumbleupon, a web page recommender.
A description of the columns is below
FieldName | Type | Description |
---|---|---|
url | string | Url of the webpage to be classified |
title | string | Title of the article |
body | string | Body text of article |
urlid | integer | StumbleUpon's unique identifier for each url |
boilerplate | json | Boilerplate text |
alchemy_category | string | Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com) |
alchemy_category_score | double | Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com) |
avglinksize | double | Average number of words in each link |
commonlinkratio_1 | double | # of links sharing at least 1 word with 1 other links / # of links |
commonlinkratio_2 | double | # of links sharing at least 1 word with 2 other links / # of links |
commonlinkratio_3 | double | # of links sharing at least 1 word with 3 other links / # of links |
commonlinkratio_4 | double | # of links sharing at least 1 word with 4 other links / # of links |
compression_ratio | double | Compression achieved on this page via gzip (measure of redundancy) |
embed_ratio | double | Count of number of usage |
frameBased | integer (0 or 1) | A page is frame-based (1) if it has no body markup but have a frameset markup |
frameTagRatio | double | Ratio of iframe markups over total number of markups |
hasDomainLink | integer (0 or 1) | True (1) if it contains an |
html_ratio | double | Ratio of tags vs text in the page |
image_ratio | double | Ratio of |
is_news | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news |
lengthyLinkDomain | integer (0 or 1) | True (1) if at least 3 |
linkwordscore | double | Percentage of words on the page that are in hyperlink's text |
news_front_page | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is front-page news |
non_markup_alphanum_characters | integer | Page's text's number of alphanumeric characters |
numberOfLinks | integer Number of | markups |
numwords_in_url | double | Number of words in url |
parametrizedLinkRatio | double | A link is parametrized if it's url contains parameters or has an attached onClick event |
spelling_errors_ratio | double | Ratio of words not found in wiki (considered to be a spelling mistake) |
label | integer (0 or 1) | User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only |
Review: Use of the Count Vectorizer
We previously used the Count Vectorizer to extract text features for this classification task
Review: Build a model to predict evergreeness of a website
Then we used those features to build a classification model
Demo: Pipelines
Often we will want to combine these steps to evaluate on some future dataset. For that incoming, future dataset, we need to make sure we perform the exact same transformations on the data. If has_brownies_in_text
is column 19, we need to make sure it is column 19 when it comes to evaluation time.
Pipelines combine all of the pre-processing steps and model building into a single object.
Rather than manually evaluating the transformers and then feeding them into the model, pipelines tie these steps together. Similar to models and vectorizers in scikit-learn, they are equipped with fit
and predict
or predict_proba
methods as any model would be, but they ensure the proper data transformations are performed
Exercise: Add a MaxAbsScaler scaling step to the pipeline as well, this should occur after the vectorization
Additionally, we want to merge many different feature sets automatically, we can use FeatureUnion