Path: blob/master/lessons/lesson_11/code/starter-code/starter-code-12.ipynb
1904 views
Predicting Evergreeness of Content with Decision Trees and Random Forests
Predicting "Greenness" Of Content
This dataset comes from stumbleupon, a web page recommender. A description of the columns is below:
FieldName | Type | Description |
---|---|---|
url | string | Url of the webpage to be classified |
title | string | Title of the article |
body | string | Body text of article |
urlid | integer | StumbleUpon's unique identifier for each url |
boilerplate | json | Boilerplate text |
alchemy_category | string | Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com) |
alchemy_category_score | double | Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com) |
avglinksize | double | Average number of words in each link |
commonlinkratio_1 | double | # of links sharing at least 1 word with 1 other links / # of links |
commonlinkratio_2 | double | # of links sharing at least 1 word with 2 other links / # of links |
commonlinkratio_3 | double | # of links sharing at least 1 word with 3 other links / # of links |
commonlinkratio_4 | double | # of links sharing at least 1 word with 4 other links / # of links |
compression_ratio | double | Compression achieved on this page via gzip (measure of redundancy) |
embed_ratio | double | Count of number of usage |
frameBased | integer (0 or 1) | A page is frame-based (1) if it has no body markup but have a frameset markup |
frameTagRatio | double | Ratio of iframe markups over total number of markups |
hasDomainLink | integer (0 or 1) | True (1) if it contains an |
html_ratio | double | Ratio of tags vs text in the page |
image_ratio | double | Ratio of |
is_news | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news |
lengthyLinkDomain | integer (0 or 1) | True (1) if at least 3 |
linkwordscore | double | Percentage of words on the page that are in hyperlink's text |
news_front_page | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is front-page news |
non_markup_alphanum_characters | integer | Page's text's number of alphanumeric characters |
numberOfLinks | integer Number of | markups |
numwords_in_url | double | Number of words in url |
parametrizedLinkRatio | double | A link is parametrized if it's url contains parameters or has an attached onClick event |
spelling_errors_ratio | double | Ratio of words not found in wiki (considered to be a spelling mistake) |
label | integer (0 or 1) | User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only |
What are 'evergreen' sites?
Evergreen sites are those that are always relevant. As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season.
A sample of URLs is below, where label = 1 are 'evergreen' websites
Exercises to Get Started
Exercise: 1. In a group: Brainstorm 3 - 5 features you could develop that would be useful for predicting evergreen websites.
Exercise: 2. After looking at the dataset, can you model or quantify any of the characteristics you wanted?
I.E. If you believe high-image content websites are likely to be evergreen, how can you build a feature that represents that?
I.E. If you believe weather content is likely NOT to be evergreen, how might you build a feature that represents that?
Split up and develop 1-3 of the those features independently.
Exercise: 3. Does being a news site affect evergreeness?
Compute or plot the percentage of news related evergreen sites.
Exercise: 4. Does category in general affect evergreeness?
Plot the rate of evergreen sites for all Alchemy categories.
Exercise: 5. How many articles are there per category?
Let's try extracting some of the text content.
Exercise: 6. Create a feature for the title containing 'recipe'.
Is the % of evegreen websites higher or lower on pages that have recipe in the the title?
Let's Explore Some Decision Trees
Demo: Build a decision tree model to predict the "evergreeness" of a given website.
Decision Trees in scikit-learn
Exercise: Evaluate the decision tree using cross-validation; use AUC as the evaluation metric.
Adjusting Decision Trees to Avoid Overfitting
Demo: Control for overfitting in the decision model by adjusting the maximum number of questions (max_depth) or the minimum number of records in each final node (min_samples_leaf)
Demo: Build a random forest model to predict the evergreeness of a website.
Demo: Extracting importance of features
Exercise: Evaluate the Random Forest model using cross-validation; increase the number of estimators and view how that improves predictive performance.
Independent Practice: Evaluate Random Forest Using Cross-Validation
Continue adding input variables to the model that you think may be relevant
For each feature:
Evaluate the model for improved predictive performance using cross-validation
Evaluate the importance of the feature
Bonus: Just like the 'recipe' feature, add in similar text features and evaluate their performance.