Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-14/code/starter-code/L14-Starter-Code.ipynb
1905 views
Kernel: Python [Root]
# Unicode Handling from __future__ import unicode_literals import codecs import numpy as np import gensim # spacy is used for pre-processing and traditional NLP import spacy from spacy.en import English # Gensim is used for LDA and word2vec from gensim.models.word2vec import Word2Vec
# Loading the tweet data filename = '../../assets/dataset/captured-tweets.txt' tweets = [] for tweet in codecs.open(filename, 'r', encoding="utf-8"): tweets.append(tweet) # Setting up spacy nlp_toolkit = English()

Exercise 1a

Write a function that can take a take a sentence parsed by spacy and identify if it mentions a company named 'Google'. Remember, spacy can find entities and codes them as ORG if they are a company. Look at the slides for class 13 if you need a hint:

Bonus (1b)

Parameterize the company name so that the function works for any company.

def mentions_company(parsed): # Return True if the sentence contains an organization and that organization is Google for entity in parsed.ents: # Fill in code here # Otherwise return False return False # 1b def mentions_company(parsed, company='Google'): # Your code here pass

Exercise 1c

Write a function that can take a sentence parsed by spacy and return the verbs of the sentence (preferably lemmatized)

def get_actions(parsed): actions = [] # Your code here return actions

Exercise 1d

For each tweet, parse it using spacy and print it out if the tweet has 'release' or 'announce' as a verb. You'll need to use your mentions_company and get_actions functions.

for tweet in tweets: parsed = nlp_toolkit(tweet) pass

Exercise 1e

Write a function that identifies countries - HINT: the entity label for countries is GPE (or GeoPolitical Entity)

def mentions_country(parsed, country): pass

Exercise 1f

Re-run (d) to find country tweets that discuss 'Iran' announcing or releasing.

for tweet in tweets: parsed = nlp_toolkit(tweet) pass

Exercise 2

Build a word2vec model of the tweets we have collected using gensim.

Exercise 2a:

First take the collection of tweets and tokenize them using spacy.

  • Think about how this should be done.

  • Should you only use upper-case or lower-case?

  • Should you remove punctuations or symbols?

text_split = [[x.text if x.pos != spacy.parts_of_speech.VERB else x.lemma_ for x in nlp_toolkit(t)] for t in tweets]

Exercise 2b:

Build a word2vec model. Test the window size as well - this is how many surrounding words need to be used to model a word. What do you think is appropriate for Twitter?

model = Word2Vec(text_split, size=100, window=4, min_count=5, workers=4)

Exercise 2c:

Test your word2vec model with a few similarity functions.

  • Find words similar to 'Syria'.

  • Find words similar to 'war'.

  • Find words similar to "Iran".

  • Find words similar to 'Verizon'.

model.most_similar(positive=['Syria'])

Exercise 2d

Adjust the choices / parameters in (b) and (c) as necessary.

Exercise 3

Filter tweets to those that mention 'Iran' or similar entities and 'war' or similar entities.

  • Do this using just spacy.

  • Do this using word2vec similarity scores.

# Using spacy for tweet in tweets: parsed = nlp_toolkit(tweet) pass
# Using word2vec similarity scores for tweet in tweets[:200]: parsed = nlp_toolkit(tweet) pass