GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_14/code/solution-code/L14-Solutions.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

In [1]:

# Unicode Handling
from __future__ import unicode_literals
import codecs

import numpy as np
import gensim

# spacy is used for pre-processing and traditional NLP
import spacy

# Gensim is used for LDA and word2vec
from gensim.models.word2vec import Word2Vec

In [2]:

# Loading the tweet data
filename = '../../assets/dataset/captured-tweets.txt'
tweets = []
for tweet in codecs.open(filename, 'r', encoding="utf-8"):
    tweets.append(tweet)
# Setting up spacy
nlp_toolkit = spacy.load('en')

Exercise 1a

Write a function that can take a take a sentence parsed by spacy and identify if it mentions a company named 'Google'. Remember, spacy can find entities and codes them as ORG if they are a company. Look at the slides for class 13 if you need a hint:

Bonus (1b)

Parameterize the company name so that the function works for any company.

In [3]:

def mentions_company(parsed):
    for entity in parsed.ents:
        if entity.text == "Google" and entity.label_ == 'ORG':
            return True
    return False

# 1b

def mentions_company(parsed, company='Google'):
    for entity in parsed.ents:
        if entity.text == company and entity.label_ == 'ORG':
            return True
    return False

Exercise 1c

Write a function that can take a sentence parsed by spacy and return the verbs of the sentence (preferably lemmatized)

In [4]:

def get_actions(parsed):
    actions = []
    for el in parsed:
        if el.pos == spacy.parts_of_speech.VERB:
            actions.append(el.text)
    return actions

Exercise 1d

For each tweet, parse it using spacy and print it out if the tweet has 'release' or 'announce' as a verb. You'll need to use your mentions_company and get_actions functions.

In [5]:

for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    if mentions_company(parsed, 'Google'):
        actions = get_actions(parsed)
        if 'release' in actions or 'announce' in actions:
            print(tweet)

Out[5]:

Google and Ford to announce partnership on self-driving cars at CES - Fudzilla (blog) https://t.co/6woe56G22Q

Google and Ford to announce partnership on self-driving cars at CES - Fudzilla (blog) https://t.co/4hERVJ4zZK

Exercise 1e

Write a function that identifies countries - HINT: the entity label for countries is GPE (or GeoPolitical Entity)

In [6]:

def mentions_country(parsed, country):
    for entity in parsed.ents:
        if entity.text == country and entity.label_ == 'GPE':
            return True
    return False

Exercise 1f

Re-run (d) to find country tweets that discuss 'Iran' announcing or releasing.

In [7]:

for tweet in tweets:
    parsed = nlp_toolkit(tweet)

    if mentions_country(parsed, 'Iran'):
        actions = get_actions(parsed)
        if 'release' in actions or 'announce' in actions:
            print(tweet)

Out[7]:

RT @cerenomri: "Literally every US ally in Mideast is on brink of hot war w/ Iran, so we're going to release $100 billion to Iran this mont…

GOBE! Iran warns Nigeria to release Shiite leader El-Zakzaky - SEE https://t.co/TRshnC6sVU

GOBE! Iran warns Nigeria to release Shiite leader El-Zakzaky - SEE https://t.co/SlvcQtk3vE

RT @cerenomri: "Literally every US ally in Mideast is on brink of hot war w/ Iran, so we're going to release $100 billion to Iran this mont…

Hhmmm. Iran claiming to have 'warned Nigeria' to release detained Shiite leader.... @afalli

RT @cerenomri: "Literally every US ally in Mideast is on brink of hot war w/ Iran, so we're going to release $100 billion to Iran this mont…

Exercise 2

Build a word2vec model of the tweets we have collected using gensim. First take the collection of tweets and tokenize them using spacy.

Exercise 2a:

Think about how this should be done.
Should you only use upper-case or lower-case?
Should you remove punctuations or symbols?

In [8]:

text_split = [[x.text if x.pos != spacy.parts_of_speech.VERB else x.lemma_ 
                for x in nlp_toolkit(t)] for t in tweets]

Exercise 2b:

Build a word2vec model. Test the window size as well - this is how many surrounding words need to be used to model a word. What do you think is appropriate for Twitter?

In [9]:

model = Word2Vec(text_split, size=100, window=4, min_count=5, workers=4)

Exercise 2c:

Test your word2vec model with a few similarity functions.

Find words similar to 'Syria'.
Find words similar to 'war'.
Find words similar to "Iran".
Find words similar to 'Verizon'.

In [10]:

model.most_similar(positive=['Syria'])

Out[10]:

/Users/munurtunca/anaconda2/envs/py3/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
  """Entry point for launching an IPython kernel.

[('opposition', 0.99893718957901),
 ('benefit', 0.997267484664917),
 ('death', 0.9971266984939575),
 ('life', 0.9967310428619385),
 ('Put', 0.9967268109321594),
 ('Iranian', 0.9966708421707153),
 ('base', 0.9964544773101807),
 ('Bing', 0.9964326024055481),
 ('Russia', 0.9963169097900391),
 ('in', 0.996167004108429)]

Exercise 2d

Adjust the choices in (b) and (c) as necessary.

In [ ]:

Exercise 3

Filter tweets to those that mention 'Iran' or similar entities and 'war' or similar entities.

Do this using just spacy.
Do this using word2vec similarity scores.

In [11]:

# Using spacy
for tweet in tweets:
    parsed = nlp_toolkit(tweet)
    if mentions_country(parsed, 'Iran') or mentions_country(parsed, 'Iraq'): # ... you could add more
        if 'attack' in get_actions(parsed):
            print(tweet)

In [12]:

# Using word2vec similarity scores
for tweet in tweets[100:110]:
    parsed = nlp_toolkit(tweet)

    similarity_to_iran = max([model.wv.similarity('Iran', tok.text) for tok in parsed if tok.text in model.wv.vocab])
    similarity_to_war = max([model.wv.similarity('war', tok.text) for tok in parsed if tok.text in model.wv.vocab])
    if similarity_to_iran > 0.9 and similarity_to_war > 0.9:
        print("Similarity to Iran:",round(similarity_to_iran,3))
        print("Similarity to War:", round(similarity_to_war,3))
        print(tweet)
        print("---------------------------------------------------------------------------------------------------------")
        #print(tweet)

Out[12]:

Similarity to Iran: 0.991
Similarity to War: 0.999
@MakingStarWars Ugh. The rights sign messes up the link. Google 'force awakens 70mm' and the IMAX link should send you there.

---------------------------------------------------------------------------------------------------------
Similarity to Iran: 0.991
Similarity to War: 0.999
RT @dinowoowife: [MY] BTS 2ND MUSTER SLOGAN "두근건" BY VISUAL SHOCK @VShock1230 https://t.co/SpccZhS5mk https://t.co/kOiqxiSZYk

---------------------------------------------------------------------------------------------------------
Similarity to Iran: 0.994
Similarity to War: 0.999
LebanonHashtag: RT probrandz: #Facebook Open Sources Artificial #Intelligence Servers Before #Google | Re/code https://t.co/VgWBxAG52q

---------------------------------------------------------------------------------------------------------
Similarity to Iran: 0.992
Similarity to War: 0.993
#di…

---------------------------------------------------------------------------------------------------------
Similarity to Iran: 0.994
Similarity to War: 0.999
RT @TheMoneyGenie: Why PayPal.Me Is So Infuriating (But Im Using It Anyway)  work from home https://t.co/JeP3833l2i https://t.co/RFJhUsiy7Z

---------------------------------------------------------------------------------------------------------
Similarity to Iran: 0.995
Similarity to War: 0.999
Ford adds Apple CarPlay and Android Auto to Sync 3,... https://t.co/NSPCS9KHWE #google | https://t.co/MZBQSll3dP https://t.co/tHKibTMDlU

---------------------------------------------------------------------------------------------------------
Similarity to Iran: 0.991
Similarity to War: 0.999
Best colour for zodiac sign in year 2016

---------------------------------------------------------------------------------------------------------
Similarity to Iran: 0.991
Similarity to War: 0.993
https://t.co/WHqjhaywhq

---------------------------------------------------------------------------------------------------------
Similarity to Iran: 0.991
Similarity to War: 0.999
@ShaffieWeru hi bro i need your help here is my new video..can you manage me bro https://t.co/NTs3QM5YU5

---------------------------------------------------------------------------------------------------------
Similarity to Iran: 0.992
Similarity to War: 0.998
Try TwitGrow for Twitter! Get 1000+ REAL followers, retweets and favorites! Google Play: https://t.co/v5qI6nbcZt https://t.co/s5wKVVeiCI

---------------------------------------------------------------------------------------------------------