GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/code/predicting_evergreen_sites-lab.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Predicting "Greenness" Of Content

Authors: Joseph Nelson (DC), Kiefer Katovich (SF)

This dataset comes from stumbleupon, a web page recommender and was made available here

A description of the columns is below

FieldName	Type	Description
url	string	Url of the webpage to be classified
urlid	integer	StumbleUpon's unique identifier for each url
boilerplate	json	Boilerplate text
alchemy_category	string	Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score	double	Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize	double	Average number of words in each link
commonLinkRatio_1	double	# of links sharing at least 1 word with 1 other links / # of links
commonLinkRatio_2	double	# of links sharing at least 1 word with 2 other links / # of links
commonLinkRatio_3	double	# of links sharing at least 1 word with 3 other links / # of links
commonLinkRatio_4	double	# of links sharing at least 1 word with 4 other links / # of links
compression_ratio	double	Compression achieved on this page via gzip (measure of redundancy)
embed_ratio	double	Count of number of usage
frameBased	integer (0 or 1)	A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio	double	Ratio of iframe markups over total number of markups
hasDomainLink	integer (0 or 1)	True (1) if it contains an
html_ratio	double	Ratio of tags vs text in the page
image_ratio	double	Ratio of tags vs text in the page
is_news	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain	integer (0 or 1)	True (1) if at least 3
linkwordscore	double	Percentage of words on the page that are in hyperlink's text
news_front_page	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters	integer	Page's text's number of alphanumeric characters
numberOfLinks	integer Number of	markups
numwords_in_url	double	Number of words in url
parametrizedLinkRatio	double	A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio	double	Ratio of words not found in wiki (considered to be a spelling mistake)
label	integer (0 or 1)	User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
%matplotlib inline

# set max printout options for pandas:
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 300

1. Load the data

Note it is a .tsv file and has a tab separator instead of comma.
Clean the is_news column.
Make two new columns, title and body, from the boilerplate column.

Note: The boilerplate column is in json dictionary format. You can use the json.loads() function from the json module to convert this into a python dictionary.

In [2]:

df = pd.read_table('../data/evergreen_sites.tsv')

In [3]:

# A: 
df['is_news']

Out[3]:

     1
     1
     1
     1
     1
     ?
     1
     ?
     1
     ?
    1
    ?
    1
    ?
    ?
    ?
    1
    1
    1
    1
    1
    ?
    ?
    1
    1
    1
    1
    ?
    1
    ?
       ..
  ?
  ?
  ?
  1
  ?
  ?
  ?
  1
  1
  1
  1
  ?
  1
  ?
  1
  ?
  ?
  1
  1
  ?
  ?
  ?
  1
  1
  ?
  1
  1
  ?
  1
  ?
Name: is_news, Length: 7395, dtype: object

In [4]:

df['is_news'] = df['is_news'].str.replace('?','0').astype(int)

2. What are 'evergreen' sites?

These are websites that always relevant like recipes or reviews (as opposed to current events).
Stored as a binary indicator in the label column.
Look at some examples.

In [6]:

# A:
df['label'].head(10)

Out[6]:

  0
  1
  1
  1
  0
  0
  1
  0
  1
  1
Name: label, dtype: int64

3. Does being a news site affect green-ness?

3.A Investigate with plots/EDA.

In [7]:

ndf = df[['is_news', 'label']]

In [8]:

# A:
ndf.corr()

Out[8]:

In [9]:

pd.crosstab(df['is_news'], df['label'], margins=True)

Out[9]:

3.B Test the hypothesis with a logistic regression using statsmodels.

Hint: The sm.logit function from statsmodels.formula.api will perform a logistic regression using a formula string.

In [10]:

import statsmodels.formula.api as sm

In [11]:

from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)


import statsmodels.formula.api as smf
result = smf.logit('label ~ is_news', data=df)
result = result.fit()
result.summary()

Out[11]:

Optimization terminated successfully.
         Current function value: 0.692751
         Iterations 3

In [12]:

# A:
# Fit a logistic regression model and store the class predictions.
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression() #create object

#feature_cols = []
X = df[['is_news']] #create X (if you are passing a single column or array, you need to double [[]] so that it reads as a df)
y = df['label']  #create y

logreg.fit(X, y) #fit
pred = logreg.predict(X) #predict

logreg.score(X, y) #this returns accuracy

Out[12]:

0.5133198106828939

3.C Interpret the results of your model.

In [ ]:

# A:

4. Does the website category affect green-ness?

4.A Investigate with plots/EDA.

In [ ]:

# A:

4.B Test the hypothesis with a logistic regression.

In [ ]:

# A:

4.C Interpret the model results.

In [ ]:

# A:

5. Does the image ratio affect green-ness?

5.A Investigate with plots/EDA.

In [ ]:

# A:

5.B Test the hypothesis using a logistic regression.

Note: It is worth thinking about how to best represent this variable. It may not be wise to input the image ratio as-is.

In [ ]:

# A:

5.C Interpret the model.

In [ ]:

# A:

6. Fit a logistic regression with multiple predictors.

The choice of predictors is up to you. Test features you think may be valuable to predict evergreen status.
Do any EDA you may need.
Interpret the coefficients of the model.

Tip: This pdf is very useful for an overview of interpreting logistic regression coefficients.

In [ ]:

# A: