GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_11/code/starter-code/starter-code-12.ipynb
¹⁹⁰⁴ views

Kernel: Python [conda env:Anaconda3]

Predicting Evergreeness of Content with Decision Trees and Random Forests

In [1]:

## DATA DICTIONARY

In [2]:

import pandas as pd
import json

data = pd.read_csv("../../assets/dataset/stumbleupon.tsv", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Out[2]:

Predicting "Greenness" Of Content

This dataset comes from stumbleupon, a web page recommender. A description of the columns is below:

FieldName	Type	Description
url	string	Url of the webpage to be classified
title	string	Title of the article
body	string	Body text of article
urlid	integer	StumbleUpon's unique identifier for each url
boilerplate	json	Boilerplate text
alchemy_category	string	Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score	double	Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize	double	Average number of words in each link
commonlinkratio_1	double	# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2	double	# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3	double	# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4	double	# of links sharing at least 1 word with 4 other links / # of links
compression_ratio	double	Compression achieved on this page via gzip (measure of redundancy)
embed_ratio	double	Count of number of usage
frameBased	integer (0 or 1)	A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio	double	Ratio of iframe markups over total number of markups
hasDomainLink	integer (0 or 1)	True (1) if it contains an
html_ratio	double	Ratio of tags vs text in the page
image_ratio	double	Ratio of tags vs text in the page
is_news	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain	integer (0 or 1)	True (1) if at least 3
linkwordscore	double	Percentage of words on the page that are in hyperlink's text
news_front_page	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters	integer	Page's text's number of alphanumeric characters
numberOfLinks	integer Number of	markups
numwords_in_url	double	Number of words in url
parametrizedLinkRatio	double	A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio	double	Ratio of words not found in wiki (considered to be a spelling mistake)
label	integer (0 or 1)	User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

What are 'evergreen' sites?

Evergreen sites are those that are always relevant. As opposed to breaking news or current events, evergreen websites are relevant no matter the time or season.

A sample of URLs is below, where label = 1 are 'evergreen' websites

In [3]:

data[['url', 'label']].head()

Out[3]:

Exercises to Get Started

Exercise: 1. In a group: Brainstorm 3 - 5 features you could develop that would be useful for predicting evergreen websites.

Exercise: 2. After looking at the dataset, can you model or quantify any of the characteristics you wanted?

I.E. If you believe high-image content websites are likely to be evergreen, how can you build a feature that represents that?
I.E. If you believe weather content is likely NOT to be evergreen, how might you build a feature that represents that?

Split up and develop 1-3 of the those features independently.

Exercise: 3. Does being a news site affect evergreeness?

Compute or plot the percentage of news related evergreen sites.

In [ ]:

# ... #

Exercise: 4. Does category in general affect evergreeness?

Plot the rate of evergreen sites for all Alchemy categories.

In [ ]:

# ... #

Exercise: 5. How many articles are there per category?

In [ ]:

# ... #

Let's try extracting some of the text content.
Exercise: 6. Create a feature for the title containing 'recipe'.

Is the % of evegreen websites higher or lower on pages that have recipe in the the title?

In [ ]:

# ... #

In [ ]:

Let's Explore Some Decision Trees

Demo: Build a decision tree model to predict the "evergreeness" of a given website.

In [5]:

data.columns

Out[5]:

Index(['url', 'urlid', 'boilerplate', 'alchemy_category',
       'alchemy_category_score', 'avglinksize', 'commonlinkratio_1',
       'commonlinkratio_2', 'commonlinkratio_3', 'commonlinkratio_4',
       'compression_ratio', 'embed_ratio', 'framebased', 'frameTagRatio',
       'hasDomainLink', 'html_ratio', 'image_ratio', 'is_news',
       'lengthyLinkDomain', 'linkwordscore', 'news_front_page',
       'non_markup_alphanum_characters', 'numberOfLinks', 'numwords_in_url',
       'parametrizedLinkRatio', 'spelling_errors_ratio', 'label', 'title',
       'body'],
      dtype='object')

In [4]:

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

X = data[['image_ratio', 'html_ratio', 'label']].dropna()
y = X['label']
X.drop('label', axis=1, inplace=True)
    
    
# Fits the model
model.fit(X, y)

# Helper function to visualize Decision Trees (creates a file tree.png)

from sklearn.tree import export_graphviz
from os import system 
def build_tree_image(model):
    dotfile = open("tree.png", 'w')
    export_graphviz(model,
                              out_file = dotfile,
                              feature_names = X.columns)
    dotfile.close()
    system("dot -Tpng tree.dot -o tree.png")
    
build_tree_image(model)

Decision Trees in scikit-learn

Exercise: Evaluate the decision tree using cross-validation; use AUC as the evaluation metric.

In [ ]:

from sklearn.cross_validation import cross_val_score

# ... #

Adjusting Decision Trees to Avoid Overfitting

Demo: Control for overfitting in the decision model by adjusting the maximum number of questions (max_depth) or the minimum number of records in each final node (min_samples_leaf)

In [ ]:

model = DecisionTreeClassifier(
                max_depth = 2,
                min_samples_leaf = 5)

model.fit(X, y)
build_tree_image(model)

Demo: Build a random forest model to predict the evergreeness of a website.

In [ ]:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 20)
    
model.fit(X, y)

Demo: Extracting importance of features

In [ ]:

features = X.columns
feature_importances = model.feature_importances_

features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort('Importance Score', inplace=True, ascending=False)

features_df.head()

Exercise: Evaluate the Random Forest model using cross-validation; increase the number of estimators and view how that improves predictive performance.

In [ ]:

# ... #

Independent Practice: Evaluate Random Forest Using Cross-Validation

Continue adding input variables to the model that you think may be relevant
For each feature:

Evaluate the model for improved predictive performance using cross-validation
Evaluate the importance of the feature

Bonus: Just like the 'recipe' feature, add in similar text features and evaluate their performance.

In [ ]:

# ... #

In [ ]: