Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/code/predicting_evergreen_sites-lab.ipynb
1904 views
Kernel: Python 3

Predicting "Greenness" Of Content

Authors: Joseph Nelson (DC), Kiefer Katovich (SF)


This dataset comes from stumbleupon, a web page recommender and was made available here

A description of the columns is below

FieldNameTypeDescription
urlstringUrl of the webpage to be classified
urlidintegerStumbleUpon's unique identifier for each url
boilerplatejsonBoilerplate text
alchemy_categorystringAlchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_scoredoubleAlchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksizedoubleAverage number of words in each link
commonLinkRatio_1double# of links sharing at least 1 word with 1 other links / # of links
commonLinkRatio_2double# of links sharing at least 1 word with 2 other links / # of links
commonLinkRatio_3double# of links sharing at least 1 word with 3 other links / # of links
commonLinkRatio_4double# of links sharing at least 1 word with 4 other links / # of links
compression_ratiodoubleCompression achieved on this page via gzip (measure of redundancy)
embed_ratiodoubleCount of number of usage
frameBasedinteger (0 or 1)A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatiodoubleRatio of iframe markups over total number of markups
hasDomainLinkinteger (0 or 1)True (1) if it contains an
html_ratiodoubleRatio of tags vs text in the page
image_ratiodoubleRatio of tags vs text in the page
is_newsinteger (0 or 1)True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomaininteger (0 or 1)True (1) if at least 3
linkwordscoredoublePercentage of words on the page that are in hyperlink's text
news_front_pageinteger (0 or 1)True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_charactersintegerPage's text's number of alphanumeric characters
numberOfLinksinteger Number of markups
numwords_in_urldoubleNumber of words in url
parametrizedLinkRatiodoubleA link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratiodoubleRatio of words not found in wiki (considered to be a spelling mistake)
labelinteger (0 or 1)User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import json %matplotlib inline # set max printout options for pandas: pd.options.display.max_columns = 50 pd.options.display.max_colwidth = 300

1. Load the data

  • Note it is a .tsv file and has a tab separator instead of comma.

  • Clean the is_news column.

  • Make two new columns, title and body, from the boilerplate column.

Note: The boilerplate column is in json dictionary format. You can use the json.loads() function from the json module to convert this into a python dictionary.

df = pd.read_table('../data/evergreen_sites.tsv')
# A: df['is_news']
0 1 1 1 2 1 3 1 4 1 5 ? 6 1 7 ? 8 1 9 ? 10 1 11 ? 12 1 13 ? 14 ? 15 ? 16 1 17 1 18 1 19 1 20 1 21 ? 22 ? 23 1 24 1 25 1 26 1 27 ? 28 1 29 ? .. 7365 ? 7366 ? 7367 ? 7368 1 7369 ? 7370 ? 7371 ? 7372 1 7373 1 7374 1 7375 1 7376 ? 7377 1 7378 ? 7379 1 7380 ? 7381 ? 7382 1 7383 1 7384 ? 7385 ? 7386 ? 7387 1 7388 1 7389 ? 7390 1 7391 1 7392 ? 7393 1 7394 ? Name: is_news, Length: 7395, dtype: object
df['is_news'] = df['is_news'].str.replace('?','0').astype(int)

2. What are 'evergreen' sites?

  • These are websites that always relevant like recipes or reviews (as opposed to current events).

  • Stored as a binary indicator in the label column.

  • Look at some examples.

# A: df['label'].head(10)
0 0 1 1 2 1 3 1 4 0 5 0 6 1 7 0 8 1 9 1 Name: label, dtype: int64

3. Does being a news site affect green-ness?

3.A Investigate with plots/EDA.

ndf = df[['is_news', 'label']]
# A: ndf.corr()
pd.crosstab(df['is_news'], df['label'], margins=True)

3.B Test the hypothesis with a logistic regression using statsmodels.

Hint: The sm.logit function from statsmodels.formula.api will perform a logistic regression using a formula string.

import statsmodels.formula.api as sm
from scipy import stats stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df) import statsmodels.formula.api as smf result = smf.logit('label ~ is_news', data=df) result = result.fit() result.summary()
Optimization terminated successfully. Current function value: 0.692751 Iterations 3
# A: # Fit a logistic regression model and store the class predictions. from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() #create object #feature_cols = [] X = df[['is_news']] #create X (if you are passing a single column or array, you need to double [[]] so that it reads as a df) y = df['label'] #create y logreg.fit(X, y) #fit pred = logreg.predict(X) #predict logreg.score(X, y) #this returns accuracy
0.5133198106828939

3.C Interpret the results of your model.

# A:

4. Does the website category affect green-ness?

4.A Investigate with plots/EDA.

# A:

4.B Test the hypothesis with a logistic regression.

# A:

4.C Interpret the model results.

# A:

5. Does the image ratio affect green-ness?

5.A Investigate with plots/EDA.

# A:

5.B Test the hypothesis using a logistic regression.

Note: It is worth thinking about how to best represent this variable. It may not be wise to input the image ratio as-is.

# A:

5.C Interpret the model.

# A:

6. Fit a logistic regression with multiple predictors.

  • The choice of predictors is up to you. Test features you think may be valuable to predict evergreen status.

  • Do any EDA you may need.

  • Interpret the coefficients of the model.

Tip: This pdf is very useful for an overview of interpreting logistic regression coefficients.

# A: