Path: blob/master/lessons/lesson_08/code/solution-code/predicting_evergreen_sites-lab-solutions.ipynb
2349 views

Predicting "Greenness" Of Content
Authors: Joseph Nelson (DC), Kiefer Katovich (SF)
This dataset comes from stumbleupon, a web page recommender and was made available here
A description of the columns is below
| FieldName | Type | Description |
|---|---|---|
| url | string | Url of the webpage to be classified |
| urlid | integer | StumbleUpon's unique identifier for each url |
| boilerplate | json | Boilerplate text |
| alchemy_category | string | Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com) |
| alchemy_category_score | double | Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com) |
| avglinksize | double | Average number of words in each link |
| commonLinkRatio_1 | double | # of links sharing at least 1 word with 1 other links / # of links |
| commonLinkRatio_2 | double | # of links sharing at least 1 word with 2 other links / # of links |
| commonLinkRatio_3 | double | # of links sharing at least 1 word with 3 other links / # of links |
| commonLinkRatio_4 | double | # of links sharing at least 1 word with 4 other links / # of links |
| compression_ratio | double | Compression achieved on this page via gzip (measure of redundancy) |
| embed_ratio | double | Count of number of usage |
| frameBased | integer (0 or 1) | A page is frame-based (1) if it has no body markup but have a frameset markup |
| frameTagRatio | double | Ratio of iframe markups over total number of markups |
| hasDomainLink | integer (0 or 1) | True (1) if it contains an |
| html_ratio | double | Ratio of tags vs text in the page |
| image_ratio | double | Ratio of |
| is_news | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news |
| lengthyLinkDomain | integer (0 or 1) | True (1) if at least 3 |
| linkwordscore | double | Percentage of words on the page that are in hyperlink's text |
| news_front_page | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is front-page news |
| non_markup_alphanum_characters | integer | Page's text's number of alphanumeric characters |
| numberOfLinks | integer Number of | markups |
| numwords_in_url | double | Number of words in url |
| parametrizedLinkRatio | double | A link is parametrized if it's url contains parameters or has an attached onClick event |
| spelling_errors_ratio | double | Ratio of words not found in wiki (considered to be a spelling mistake) |
| label | integer (0 or 1) | User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only |
1. Load the data
Note it is a
.tsvfile and has a tab separator instead of comma.Clean the
is_newscolumn.Make two new columns,
titleandbody, from theboilerplatecolumn.
Note: The
boilerplatecolumn is in json dictionary format. You can use thejson.loads()function from thejsonmodule to convert this into a python dictionary.
2. What are 'evergreen' sites?
These are websites that always relevant like recipes or reviews (as opposed to current events).
Stored as a binary indicator in the
labelcolumn.Look at some examples.
3. Does being a news site affect green-ness?
3.A Investigate with plots/EDA.
3.B Test the hypothesis with a logistic regression using statsmodels.
Hint: The
sm.logitfunction fromstatsmodels.formula.apiwill perform a logistic regression using a formula string.
3.C Interpret the results of your model.
4. Does the website category affect green-ness?
4.A Investigate with plots/EDA.
4.B Test the hypothesis with a logistic regression.
4.C Interpret the model results.
5. Does the image ratio affect green-ness?
5.A Investigate with plots/EDA.
5.B Test the hypothesis using a logistic regression.
Note: It is worth thinking about how to best represent this variable. It may not be wise to input the image ratio as-is.
5.C Interpret the model.
6. Fit a logistic regression with multiple predictors.
The choice of predictors is up to you. Test features you think may be valuable to predict evergreen status.
Do any EDA you may need.
Interpret the coefficients of the model.
Tip: This pdf is very useful for an overview of interpreting logistic regression coefficients.