CoCalc -- starter-code-18.ipynb

GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-18/code/starter-code/starter-code-18.ipynb
¹⁹⁰⁵ views

Kernel: Python 3

In [1]:

import pandas as pd
import json

data = pd.read_csv("../../assets/dataset/stumbleupon.tsv", sep='\t')
data['title'] = data.boilerplate.map(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.map(lambda x: json.loads(x).get('body', ''))
data.head()

Out[1]:

Predicting "Greenness" Of Content

This dataset comes from stumbleupon, a web page recommender.

A description of the columns is below

FieldName	Type	Description
url	string	Url of the webpage to be classified
title	string	Title of the article
body	string	Body text of article
urlid	integer	StumbleUpon's unique identifier for each url
boilerplate	json	Boilerplate text
alchemy_category	string	Alchemy category (per the publicly available Alchemy API found at www.alchemyapi.com)
alchemy_category_score	double	Alchemy category score (per the publicly available Alchemy API found at www.alchemyapi.com)
avglinksize	double	Average number of words in each link
commonlinkratio_1	double	# of links sharing at least 1 word with 1 other links / # of links
commonlinkratio_2	double	# of links sharing at least 1 word with 2 other links / # of links
commonlinkratio_3	double	# of links sharing at least 1 word with 3 other links / # of links
commonlinkratio_4	double	# of links sharing at least 1 word with 4 other links / # of links
compression_ratio	double	Compression achieved on this page via gzip (measure of redundancy)
embed_ratio	double	Count of number of usage
frameBased	integer (0 or 1)	A page is frame-based (1) if it has no body markup but have a frameset markup
frameTagRatio	double	Ratio of iframe markups over total number of markups
hasDomainLink	integer (0 or 1)	True (1) if it contains an
html_ratio	double	Ratio of tags vs text in the page
image_ratio	double	Ratio of tags vs text in the page
is_news	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is news
lengthyLinkDomain	integer (0 or 1)	True (1) if at least 3
linkwordscore	double	Percentage of words on the page that are in hyperlink's text
news_front_page	integer (0 or 1)	True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
non_markup_alphanum_characters	integer	Page's text's number of alphanumeric characters
numberOfLinks	integer Number of	markups
numwords_in_url	double	Number of words in url
parametrizedLinkRatio	double	A link is parametrized if it's url contains parameters or has an attached onClick event
spelling_errors_ratio	double	Ratio of words not found in wiki (considered to be a spelling mistake)
label	integer (0 or 1)	User-determined label. Either evergreen (1) or non-evergreen (0); available for train.tsv only

Review: Use of the Count Vectorizer

We previously used the Count Vectorizer to extract text features for this classification task

In [2]:

titles = data['title'].fillna('')

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1000, 
                             ngram_range=(1, 2), 
                             stop_words='english',
                             binary=True)

# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(titles)

# Use `tranform` to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles)

Review: Build a model to predict evergreeness of a website

Then we used those features to build a classification model

In [3]:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(penalty = 'l1')
y = data['label']

from sklearn.cross_validation import cross_val_score

scores = cross_val_score(model, X, y, scoring='roc_auc')
print('CV AUC {}, Average AUC {}'.format(scores, scores.mean()))

Out[3]:

CV AUC [ 0.8153153   0.8257859   0.81604862], Average AUC 0.8190499392060214

Demo: Pipelines

Often we will want to combine these steps to evaluate on some future dataset. For that incoming, future dataset, we need to make sure we perform the exact same transformations on the data. If has_brownies_in_text is column 19, we need to make sure it is column 19 when it comes to evaluation time.

Pipelines combine all of the pre-processing steps and model building into a single object.

Rather than manually evaluating the transformers and then feeding them into the model, pipelines tie these steps together. Similar to models and vectorizers in scikit-learn, they are equipped with fit and predict or predict_proba methods as any model would be, but they ensure the proper data transformations are performed

In [12]:

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('features', vectorizer),
        ('model', model)   
    ])

In [13]:

# Split the data into a training set
training_data = data[:6000]
X_train = training_data['title'].fillna('')
y_train = training_data['label']

# These rows are rows obtained in the future, unavailable at training time
X_new = data[6000:]['title'].fillna('')

In [14]:

# Fit the full pipeline
# This means we perform the steps laid out above
# First we fit the vectorizer, 
# and then feed the output of that into the fit function of the model

pipeline.fit(X_train, y_train)

Out[14]:

Pipeline(steps=[('features', CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
     ...ty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [15]:

# Here again we apply the full pipeline for predictions
# The text is transformed automatically to match the features from the pipeline
pipeline.predict_proba(X_new)

Out[15]:

array([[ 0.54498618,  0.45501382],
       [ 0.40242714,  0.59757286],
       [ 0.01265358,  0.98734642],
       ..., 
       [ 0.29678077,  0.70321923],
       [ 0.61248369,  0.38751631],
       [ 0.63551779,  0.36448221]])

Exercise: Add a MaxAbsScaler scaling step to the pipeline as well, this should occur after the vectorization

In [1]:

#TODO

In [4]:

#TODO

Additionally, we want to merge many different feature sets automatically, we can use FeatureUnion

In [ ]:

Predicting "Greenness" Of Content

Review: Use of the Count Vectorizer

Review: Build a model to predict evergreeness of a website

Demo: Pipelines

Exercise: Add a MaxAbsScaler scaling step to the pipeline as well, this should occur after the vectorization

Product

Resources

Company