Path: blob/master/lessons/lesson_08/code/solution-code/predicting_evergreen_sites-lab-solutions.ipynb
1904 views
Predicting "Greenness" Of Content
Authors: Joseph Nelson (DC), Kiefer Katovich (SF)
This dataset comes from stumbleupon, a web page recommender and was made available here
A description of the columns is below
FieldName | Type | Description |
---|---|---|
url | string | Url of the webpage to be classified |
urlid | integer | StumbleUpon's unique identifier for each url |
boilerplate | json | Boilerplate text |
alchemy_category | string | Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com) |
alchemy_category_score | double | Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com) |
avglinksize | double | Average number of words in each link |
commonLinkRatio_1 | double | # of links sharing at least 1 word with 1 other links / # of links |
commonLinkRatio_2 | double | # of links sharing at least 1 word with 2 other links / # of links |
commonLinkRatio_3 | double | # of links sharing at least 1 word with 3 other links / # of links |
commonLinkRatio_4 | double | # of links sharing at least 1 word with 4 other links / # of links |
compression_ratio | double | Compression achieved on this page via gzip (measure of redundancy) |
embed_ratio | double | Count of number of usage |
frameBased | integer (0 or 1) | A page is frame-based (1) if it has no body markup but have a frameset markup |
frameTagRatio | double | Ratio of iframe markups over total number of markups |
hasDomainLink | integer (0 or 1) | True (1) if it contains an |
html_ratio | double | Ratio of tags vs text in the page |
image_ratio | double | Ratio of |
is_news | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is news |
lengthyLinkDomain | integer (0 or 1) | True (1) if at least 3 |
linkwordscore | double | Percentage of words on the page that are in hyperlink's text |
news_front_page | integer (0 or 1) | True (1) if StumbleUpon's news classifier determines that this webpage is front-page news |
non_markup_alphanum_characters | integer | Page's text's number of alphanumeric characters |
numberOfLinks | integer Number of | markups |
numwords_in_url | double | Number of words in url |
parametrizedLinkRatio | double | A link is parametrized if it's url contains parameters or has an attached onClick event |
spelling_errors_ratio | double | Ratio of words not found in wiki (considered to be a spelling mistake) |
label | integer (0 or 1) | User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only |
1. Load the data
Note it is a
.tsv
file and has a tab separator instead of comma.Clean the
is_news
column.Make two new columns,
title
andbody
, from theboilerplate
column.
Note: The
boilerplate
column is in json dictionary format. You can use thejson.loads()
function from thejson
module to convert this into a python dictionary.
2. What are 'evergreen' sites?
These are websites that always relevant like recipes or reviews (as opposed to current events).
Stored as a binary indicator in the
label
column.Look at some examples.
3. Does being a news site affect green-ness?
3.A Investigate with plots/EDA.
3.B Test the hypothesis with a logistic regression using statsmodels.
Hint: The
sm.logit
function fromstatsmodels.formula.api
will perform a logistic regression using a formula string.
3.C Interpret the results of your model.
4. Does the website category affect green-ness?
4.A Investigate with plots/EDA.
4.B Test the hypothesis with a logistic regression.
4.C Interpret the model results.
5. Does the image ratio affect green-ness?
5.A Investigate with plots/EDA.
5.B Test the hypothesis using a logistic regression.
Note: It is worth thinking about how to best represent this variable. It may not be wise to input the image ratio as-is.
5.C Interpret the model.
6. Fit a logistic regression with multiple predictors.
The choice of predictors is up to you. Test features you think may be valuable to predict evergreen status.
Do any EDA you may need.
Interpret the coefficients of the model.
Tip: This pdf is very useful for an overview of interpreting logistic regression coefficients.