GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/code/solution-code/predicting_evergreen_sites-lab-solutions.ipynb
¹⁹⁰⁴ views

Kernel: Python 2

Predicting "Greenness" Of Content

Authors: Joseph Nelson (DC), Kiefer Katovich (SF)

This dataset comes from stumbleupon, a web page recommender and was made available here

A description of the columns is below

FieldName	Type	Description
url	string	Url of the webpage to be classified
urlid	integer	StumbleUpon's unique identifier for each url
boilerplate	json	Boilerplate text
alchemy_category	string	Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score	double	Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize	double	Average number of words in each link
commonLinkRatio_1	double	# of links sharing at least 1 word with 1 other links / # of links
commonLinkRatio_2	double	# of links sharing at least 1 word with 2 other links / # of links
commonLinkRatio_3	double	# of links sharing at least 1 word with 3 other links / # of links
commonLinkRatio_4	double	# of links sharing at least 1 word with 4 other links / # of links
compression_ratio	double	Compression achieved on this page via gzip (measure of redundancy)
embed_ratio	double	Count of number of usage
frameBased	integer (0 or 1)	A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio	double	Ratio of iframe markups over total number of markups
hasDomainLink	integer (0 or 1)	True (1) if it contains an
html_ratio	double	Ratio of tags vs text in the page
image_ratio	double	Ratio of tags vs text in the page
is_news	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain	integer (0 or 1)	True (1) if at least 3
linkwordscore	double	Percentage of words on the page that are in hyperlink's text
news_front_page	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters	integer	Page's text's number of alphanumeric characters
numberOfLinks	integer Number of	markups
numwords_in_url	double	Number of words in url
parametrizedLinkRatio	double	A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio	double	Ratio of words not found in wiki (considered to be a spelling mistake)
label	integer (0 or 1)	User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
%matplotlib inline

# set max printout options for pandas:
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 300

1. Load the data

Note it is a .tsv file and has a tab separator instead of comma.
Clean the is_news column.
Make two new columns, title and body, from the boilerplate column.

Note: The boilerplate column is in json dictionary format. You can use the json.loads() function from the json module to convert this into a python dictionary.

In [2]:

evergreen_tsv = '../../data/evergreen_sites.tsv'

In [3]:

data = pd.read_csv(evergreen_tsv, sep='\t', na_values={'is_news' : '?'}).fillna(0)

# Extract the title and body from the boilerplate JSON text
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))

2. What are 'evergreen' sites?

These are websites that always relevant like recipes or reviews (as opposed to current events).
Stored as a binary indicator in the label column.
Look at some examples.

In [4]:

data[['title', 'label']].head()

Out[4]:

3. Does being a news site affect green-ness?

3.A Investigate with plots/EDA.

In [5]:

print((data.groupby('is_news')[['label']].mean()))
sns.factorplot(x='is_news', y='label', data=data, kind='bar')

Out[5]:

            label
is_news          
0.0      0.507562
1.0      0.516916

<seaborn.axisgrid.FacetGrid at 0x112941438>

3.B Test the hypothesis with a logistic regression using statsmodels.

Hint: The sm.logit function from statsmodels.formula.api will perform a logistic regression using a formula string.

In [6]:

import statsmodels.formula.api as sm

In [7]:

news_data = data[['label','is_news']]

news_model = sm.logit("label ~ is_news", data=news_data).fit()
news_model.summary()

Out[7]:

Optimization terminated successfully.
         Current function value: 0.692751
         Iterations 3

3.C Interpret the results of your model.

In [8]:

# The effect of being a news site on evergreen status is insignificant.
# More formally, we would accept the null hypothesis that news sites and
# non-news sites have equal probability of being evergreen.

4. Does the website category affect green-ness?

4.A Investigate with plots/EDA.

In [9]:

# ? and unknown should be the same category:
data['alchemy_category'] = data.alchemy_category.map(lambda x: 'unknown' if x == '?' else x)

In [10]:

print((data.groupby('alchemy_category')[['label']].mean()))

sns.factorplot(x='alchemy_category', y='label', 
               data=data, kind='bar', aspect=3).set_xticklabels(rotation=45, horizontalalignment='right')

Out[10]:

                       label
alchemy_category            
arts_entertainment  0.371945
business            0.711364
computer_internet   0.246622
culture_politics    0.457726
gaming              0.368421
health              0.573123
law_crime           0.419355
recreation          0.684296
religion            0.416667
science_technology  0.456747
sports              0.205263
unknown             0.501704
weather             0.000000

<seaborn.axisgrid.FacetGrid at 0x107f38b00>

4.B Test the hypothesis with a logistic regression.

In [11]:

cat_model = sm.logit("label ~ C(alchemy_category, Treatment(reference='unknown'))", data=data).fit()
cat_model.summary()

Out[11]:

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.649499
         Iterations: 35

/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

4.C Interpret the model results.

In [12]:

# Many of the categories appear to have a significant effect on the likelihood of evergreen
# status. Note that I have set the reference category to be unknown. This is wrapped into
# the intercept term. These categories must be interpreted as significantly different from
# unknown or not.

# Positive predictors of evergreen vs. unknown:
# 1. Business
# 2. Health
# 3. Recreation

# Negative predictors of evergreen vs. unkown:
# 1. Arts and entertainment
# 2. Computer and internet
# 3. Gaming
# 4. Sports

# The rest of the categories are not significantly different than the unkown category
# in their probability of being evergreen or not.

5. Does the image ratio affect green-ness?

5.A Investigate with plots/EDA.

In [13]:

sns.distplot(data.image_ratio, bins=30, kde=False)

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x113d336d8>

In [14]:

# qcut can divide things up by quantile - in this case into 5 bins
data['image_ratio_qbinned'] = pd.qcut(data['image_ratio'], 5)

sns.factorplot('image_ratio_qbinned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, 
                                                                                  horizontalalignment='right')

Out[14]:

<seaborn.axisgrid.FacetGrid at 0x113d33668>

5.B Test the hypothesis using a logistic regression.

Note: It is worth thinking about how to best represent this variable. It may not be wise to input the image ratio as-is.

In [15]:

# a model using image ratio alone (ignoring the apparent nonlinear effect and skewed distribution):
image_model = sm.logit("label ~ image_ratio", data=data).fit()
image_model.summary()

Out[15]:

Optimization terminated successfully.
         Current function value: 0.692631
         Iterations 5

In [16]:

# convert the image ratio to percentiles (this is what qcut is representing in bins):
# you can use the scipy.stats.percentileofscore for this:
from scipy import stats

data['image_ratio_pctl'] = data.image_ratio.map(lambda x: stats.percentileofscore(data.image_ratio.values, x))

In [17]:

sns.distplot(data.image_ratio_pctl, bins=30, kde=False)

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x113e23c18>

In [18]:

# use the image_ratio_percentile instead
# this is still ignoring the nonlinearity we wee in the plot above!
image_model = sm.logit("label ~ image_ratio_pctl", data=data).fit()
image_model.summary()

Out[18]:

Optimization terminated successfully.
         Current function value: 0.692458
         Iterations 3

In [19]:

# Fit a model with the percentile and the percentile squared (quadratic effect)
# This will let us model that inverse parabola
# Note: statsmodels formulas can take numpy functions!
image_model = sm.logit("label ~ image_ratio_pctl + np.power(image_ratio_pctl, 2)", data=data).fit()
image_model.summary()

Out[19]:

Optimization terminated successfully.
         Current function value: 0.686094
         Iterations 4

5.C Interpret the model.

In [20]:

# Once it's modeled well (convert the image ratio to percentiles and include
# a quadratic term) we can see these significant effects:

# 1. There is a positive effect of the image ratio percentile score (its rank 
# across image_ratios)

# 2. There is a negative quadratic effect of image ratio. That is to say, at
# a certain point the squared term of image_ratio_pctl overtakes the linear
# term. The highest probability of evergreen sites have image ratios in the
# median range.

6. Fit a logistic regression with multiple predictors.

The choice of predictors is up to you. Test features you think may be valuable to predict evergreen status.
Do any EDA you may need.
Interpret the coefficients of the model.

Tip: This pdf is very useful for an overview of interpreting logistic regression coefficients.

In [21]:

# look at the distribution of html_ratio
sns.distplot(data.html_ratio, bins=30, kde=False)

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x11872f780>

In [22]:

# cut can divide things up into linear bins - in this case into 5 bins
data['html_ratio_binned'] = pd.cut(data['html_ratio'], 5)
sns.factorplot('html_ratio_binned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, 
                                                                                 horizontalalignment='right')

Out[22]:

<seaborn.axisgrid.FacetGrid at 0x1185782b0>

In [23]:

# cut can divide things up into linear bins - in this case into 5 bins
data['html_ratio_qbinned'] = pd.qcut(data['html_ratio'], 5)
sns.factorplot('html_ratio_qbinned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, 
                                                                                 horizontalalignment='right')

Out[23]:

<seaborn.axisgrid.FacetGrid at 0x1131adcf8>

In [24]:

data['html_ratio_pctl'] = data.html_ratio.map(lambda x: stats.percentileofscore(data.html_ratio.values, x))

In [25]:

# You can see scipy puts percentiles from 0-100: important for interpreting coefs
data.html_ratio_pctl.head()

Out[25]:

  63.029074
  26.747803
  46.085193
  78.417850
  48.275862
Name: html_ratio_pctl, dtype: float64

In [26]:

def title_len(x):
    try:
        return len(x.split())
    except:
        return 0.

# calculate the number of words in the title and plot distribution
data['title_words'] = data.title.map(title_len)
sns.distplot(data.title_words, bins=30, kde=False)

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x118cec898>

In [27]:

data['title_words_binned'] = pd.cut(data['title_words'], 10)

sns.factorplot('title_words_binned', 'label', data=data, aspect=2).set_xticklabels(rotation=45, 
                                                                                 horizontalalignment='right')

Out[27]:

<seaborn.axisgrid.FacetGrid at 0x118ccda58>

In [28]:

# Build a model with the image ratio percentile, html ratio, and title length
f = '''
label ~ image_ratio_pctl + np.power(image_ratio_pctl, 2) + html_ratio_pctl + title_words
'''
model = sm.logit(f, data=data).fit()
model.summary()

Out[28]:

Optimization terminated successfully.
         Current function value: 0.667797
         Iterations 5

In [29]:

# exponentiate the coefficients to get the odds ratio:
np.exp(model.params)

Out[29]:

Intercept                        1.470349
image_ratio_pctl                 1.038888
np.power(image_ratio_pctl, 2)    0.999585
html_ratio_pctl                  0.992222
title_words                      0.956410
dtype: float64

In [30]:

# We've got all significant effects on our predictors here.
# Must interpret them as odds ratios.
# 1. for a 1 percentile increase in image_ratio, there is a ~1.03x increase in the odds of evergreen
# 2. for a 1 unit increase in image_ratio_pctl**2, there is a ~0.999x decrease in the odds of evergreen
# 3. for a 1 percentile increase in html_ratio, there is a ~0.992x decrease in the odds of evergreen
# 4. for a 1 word increase in the length of the title, there is a ~0.956x decrease in the odds of evergreen

In [31]: