Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/code/solution-code/predicting_evergreen_sites-lab-solutions.ipynb
1904 views
Kernel: Python 2

Predicting "Greenness" Of Content

Authors: Joseph Nelson (DC), Kiefer Katovich (SF)


This dataset comes from stumbleupon, a web page recommender and was made available here

A description of the columns is below

FieldNameTypeDescription
urlstringUrl of the webpage to be classified
urlidintegerStumbleUpon's unique identifier for each url
boilerplatejsonBoilerplate text
alchemy_categorystringAlchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_scoredoubleAlchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksizedoubleAverage number of words in each link
commonLinkRatio_1double# of links sharing at least 1 word with 1 other links / # of links
commonLinkRatio_2double# of links sharing at least 1 word with 2 other links / # of links
commonLinkRatio_3double# of links sharing at least 1 word with 3 other links / # of links
commonLinkRatio_4double# of links sharing at least 1 word with 4 other links / # of links
compression_ratiodoubleCompression achieved on this page via gzip (measure of redundancy)
embed_ratiodoubleCount of number of usage
frameBasedinteger (0 or 1)A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatiodoubleRatio of iframe markups over total number of markups
hasDomainLinkinteger (0 or 1)True (1) if it contains an
html_ratiodoubleRatio of tags vs text in the page
image_ratiodoubleRatio of tags vs text in the page
is_newsinteger (0 or 1)True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomaininteger (0 or 1)True (1) if at least 3
linkwordscoredoublePercentage of words on the page that are in hyperlink's text
news_front_pageinteger (0 or 1)True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_charactersintegerPage's text's number of alphanumeric characters
numberOfLinksinteger Number of markups
numwords_in_urldoubleNumber of words in url
parametrizedLinkRatiodoubleA link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratiodoubleRatio of words not found in wiki (considered to be a spelling mistake)
labelinteger (0 or 1)User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import json %matplotlib inline # set max printout options for pandas: pd.options.display.max_columns = 50 pd.options.display.max_colwidth = 300

1. Load the data

  • Note it is a .tsv file and has a tab separator instead of comma.

  • Clean the is_news column.

  • Make two new columns, title and body, from the boilerplate column.

Note: The boilerplate column is in json dictionary format. You can use the json.loads() function from the json module to convert this into a python dictionary.

evergreen_tsv = '../../data/evergreen_sites.tsv'
data = pd.read_csv(evergreen_tsv, sep='\t', na_values={'is_news' : '?'}).fillna(0) # Extract the title and body from the boilerplate JSON text data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', '')) data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))

2. What are 'evergreen' sites?

  • These are websites that always relevant like recipes or reviews (as opposed to current events).

  • Stored as a binary indicator in the label column.

  • Look at some examples.

data[['title', 'label']].head()

3. Does being a news site affect green-ness?

3.A Investigate with plots/EDA.

print((data.groupby('is_news')[['label']].mean())) sns.factorplot(x='is_news', y='label', data=data, kind='bar')
label is_news 0.0 0.507562 1.0 0.516916
<seaborn.axisgrid.FacetGrid at 0x112941438>
Image in a Jupyter notebook

3.B Test the hypothesis with a logistic regression using statsmodels.

Hint: The sm.logit function from statsmodels.formula.api will perform a logistic regression using a formula string.

import statsmodels.formula.api as sm
news_data = data[['label','is_news']] news_model = sm.logit("label ~ is_news", data=news_data).fit() news_model.summary()
Optimization terminated successfully. Current function value: 0.692751 Iterations 3

3.C Interpret the results of your model.

# The effect of being a news site on evergreen status is insignificant. # More formally, we would accept the null hypothesis that news sites and # non-news sites have equal probability of being evergreen.

4. Does the website category affect green-ness?

4.A Investigate with plots/EDA.

# ? and unknown should be the same category: data['alchemy_category'] = data.alchemy_category.map(lambda x: 'unknown' if x == '?' else x)
print((data.groupby('alchemy_category')[['label']].mean())) sns.factorplot(x='alchemy_category', y='label', data=data, kind='bar', aspect=3).set_xticklabels(rotation=45, horizontalalignment='right')
label alchemy_category arts_entertainment 0.371945 business 0.711364 computer_internet 0.246622 culture_politics 0.457726 gaming 0.368421 health 0.573123 law_crime 0.419355 recreation 0.684296 religion 0.416667 science_technology 0.456747 sports 0.205263 unknown 0.501704 weather 0.000000
<seaborn.axisgrid.FacetGrid at 0x107f38b00>
Image in a Jupyter notebook

4.B Test the hypothesis with a logistic regression.

cat_model = sm.logit("label ~ C(alchemy_category, Treatment(reference='unknown'))", data=data).fit() cat_model.summary()
Warning: Maximum number of iterations has been exceeded. Current function value: 0.649499 Iterations: 35
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)

4.C Interpret the model results.

# Many of the categories appear to have a significant effect on the likelihood of evergreen # status. Note that I have set the reference category to be unknown. This is wrapped into # the intercept term. These categories must be interpreted as significantly different from # unknown or not. # Positive predictors of evergreen vs. unknown: # 1. Business # 2. Health # 3. Recreation # Negative predictors of evergreen vs. unkown: # 1. Arts and entertainment # 2. Computer and internet # 3. Gaming # 4. Sports # The rest of the categories are not significantly different than the unkown category # in their probability of being evergreen or not.

5. Does the image ratio affect green-ness?

5.A Investigate with plots/EDA.

sns.distplot(data.image_ratio, bins=30, kde=False)
<matplotlib.axes._subplots.AxesSubplot at 0x113d336d8>
Image in a Jupyter notebook
# qcut can divide things up by quantile - in this case into 5 bins data['image_ratio_qbinned'] = pd.qcut(data['image_ratio'], 5) sns.factorplot('image_ratio_qbinned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, horizontalalignment='right')
<seaborn.axisgrid.FacetGrid at 0x113d33668>
Image in a Jupyter notebook

5.B Test the hypothesis using a logistic regression.

Note: It is worth thinking about how to best represent this variable. It may not be wise to input the image ratio as-is.

# a model using image ratio alone (ignoring the apparent nonlinear effect and skewed distribution): image_model = sm.logit("label ~ image_ratio", data=data).fit() image_model.summary()
Optimization terminated successfully. Current function value: 0.692631 Iterations 5
# convert the image ratio to percentiles (this is what qcut is representing in bins): # you can use the scipy.stats.percentileofscore for this: from scipy import stats data['image_ratio_pctl'] = data.image_ratio.map(lambda x: stats.percentileofscore(data.image_ratio.values, x))
sns.distplot(data.image_ratio_pctl, bins=30, kde=False)
<matplotlib.axes._subplots.AxesSubplot at 0x113e23c18>
Image in a Jupyter notebook
# use the image_ratio_percentile instead # this is still ignoring the nonlinearity we wee in the plot above! image_model = sm.logit("label ~ image_ratio_pctl", data=data).fit() image_model.summary()
Optimization terminated successfully. Current function value: 0.692458 Iterations 3
# Fit a model with the percentile and the percentile squared (quadratic effect) # This will let us model that inverse parabola # Note: statsmodels formulas can take numpy functions! image_model = sm.logit("label ~ image_ratio_pctl + np.power(image_ratio_pctl, 2)", data=data).fit() image_model.summary()
Optimization terminated successfully. Current function value: 0.686094 Iterations 4

5.C Interpret the model.

# Once it's modeled well (convert the image ratio to percentiles and include # a quadratic term) we can see these significant effects: # 1. There is a positive effect of the image ratio percentile score (its rank # across image_ratios) # 2. There is a negative quadratic effect of image ratio. That is to say, at # a certain point the squared term of image_ratio_pctl overtakes the linear # term. The highest probability of evergreen sites have image ratios in the # median range.

6. Fit a logistic regression with multiple predictors.

  • The choice of predictors is up to you. Test features you think may be valuable to predict evergreen status.

  • Do any EDA you may need.

  • Interpret the coefficients of the model.

Tip: This pdf is very useful for an overview of interpreting logistic regression coefficients.

# look at the distribution of html_ratio sns.distplot(data.html_ratio, bins=30, kde=False)
<matplotlib.axes._subplots.AxesSubplot at 0x11872f780>
Image in a Jupyter notebook
# cut can divide things up into linear bins - in this case into 5 bins data['html_ratio_binned'] = pd.cut(data['html_ratio'], 5) sns.factorplot('html_ratio_binned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, horizontalalignment='right')
<seaborn.axisgrid.FacetGrid at 0x1185782b0>
Image in a Jupyter notebook
# cut can divide things up into linear bins - in this case into 5 bins data['html_ratio_qbinned'] = pd.qcut(data['html_ratio'], 5) sns.factorplot('html_ratio_qbinned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, horizontalalignment='right')
<seaborn.axisgrid.FacetGrid at 0x1131adcf8>
Image in a Jupyter notebook
data['html_ratio_pctl'] = data.html_ratio.map(lambda x: stats.percentileofscore(data.html_ratio.values, x))
# You can see scipy puts percentiles from 0-100: important for interpreting coefs data.html_ratio_pctl.head()
0 63.029074 1 26.747803 2 46.085193 3 78.417850 4 48.275862 Name: html_ratio_pctl, dtype: float64
def title_len(x): try: return len(x.split()) except: return 0. # calculate the number of words in the title and plot distribution data['title_words'] = data.title.map(title_len) sns.distplot(data.title_words, bins=30, kde=False)
<matplotlib.axes._subplots.AxesSubplot at 0x118cec898>
Image in a Jupyter notebook
data['title_words_binned'] = pd.cut(data['title_words'], 10) sns.factorplot('title_words_binned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, horizontalalignment='right')
<seaborn.axisgrid.FacetGrid at 0x118ccda58>
Image in a Jupyter notebook
# Build a model with the image ratio percentile, html ratio, and title length f = ''' label ~ image_ratio_pctl + np.power(image_ratio_pctl, 2) + html_ratio_pctl + title_words ''' model = sm.logit(f, data=data).fit() model.summary()
Optimization terminated successfully. Current function value: 0.667797 Iterations 5
# exponentiate the coefficients to get the odds ratio: np.exp(model.params)
Intercept 1.470349 image_ratio_pctl 1.038888 np.power(image_ratio_pctl, 2) 0.999585 html_ratio_pctl 0.992222 title_words 0.956410 dtype: float64
# We've got all significant effects on our predictors here. # Must interpret them as odds ratios. # 1. for a 1 percentile increase in image_ratio, there is a ~1.03x increase in the odds of evergreen # 2. for a 1 unit increase in image_ratio_pctl**2, there is a ~0.999x decrease in the odds of evergreen # 3. for a 1 percentile increase in html_ratio, there is a ~0.992x decrease in the odds of evergreen # 4. for a 1 word increase in the length of the title, there is a ~0.956x decrease in the odds of evergreen