GitHub Repository: kavgan/nlp-in-practice
Path: blob/master/pre-trained-embeddings/Pre-trained embeddings.ipynb
³⁴⁴ views

Kernel: Python 3

Using pre-trained embeddings and NLP corpora

Gensim has some really nice functionality, in that it allows you to use pre-trained GloVe and Word2Vec embeddings with its libraries. In addition there are also some re-usable corpora that you can download and immediately use to train a Word2Vec embedding. The code snippets below show you how. The source of the embeddings can be found here: https://github.com/RaRe-Technologies/gensim-data.

I'll have to warn you that I'm not impressed with the quality of the pre-trained word embeddings. Either the dataset is noisy or its just too general. To be explained more later.

In [105]:

import warnings
warnings.filterwarnings('ignore')

Imports

In [106]:

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

Pre-trained: Twitter GloVe Embeddings

This first step downloads the pre-trained embeddings and loads it for re-use. Note that these are GloVe embeddings built using Tweets as the name suggests. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source can be found here: https://nlp.stanford.edu/projects/glove/. The 25 in the model name refers to the dimensionality of the vectors.

In [107]:

# download the model and return as object ready for use
model_glove_twitter = api.load("glove-twitter-25")

Once you have loaded the pre-trained model, just use it as you would with any gensim word2vec model. Here are a few similarity examples:

In [108]:

model_glove_twitter.wv.most_similar("pelosi",topn=10)

Out[108]:

[('clegg', 0.9653651118278503),
 ('miliband', 0.9515050053596497),
 ('bachmann', 0.9484400749206543),
 ('mcconnell', 0.9416399002075195),
 ('carney', 0.934025764465332),
 ('coulter', 0.9311323761940002),
 ('boehner', 0.9286302328109741),
 ('santorum', 0.9269059300422668),
 ('farage', 0.919365406036377),
 ('mourdock', 0.9186689853668213)]

In [109]:

model_glove_twitter.wv.most_similar("policies",topn=10)

Out[109]:

[('policy', 0.9484813213348389),
 ('reforms', 0.9403933882713318),
 ('laws', 0.94012051820755),
 ('government', 0.923071026802063),
 ('regulations', 0.916893482208252),
 ('economy', 0.9110006093978882),
 ('immigration', 0.9105910062789917),
 ('legislation', 0.908964991569519),
 ('govt', 0.9054746627807617),
 ('regulation', 0.9050779342651367)]

Which of these words don't fit?

In [110]:

#what doesn't fit?
model_glove_twitter.wv.doesnt_match(["trump","bernie","obama","pelosi","orange"])

Out[110]:

'orange'

Word vectors for trump and obama

In [111]:

# show weight vector for trump and obama
model_glove_twitter["trump"],model_glove_twitter['obama']

Out[111]:

(array([-0.56174 ,  0.69419 ,  0.16733 ,  0.055867, -0.26266 , -0.6303  ,
        -0.28311 , -0.88244 ,  0.57317 , -0.82376 ,  0.46728 ,  0.48607 ,
        -2.1942  , -0.41972 ,  0.31795 , -0.70063 ,  0.060693,  0.45279 ,
         0.6564  ,  0.20738 ,  0.84496 , -0.087537, -0.38856 , -0.97028 ,
        -0.40427 ], dtype=float32),
 array([ 0.77126 ,  0.81259 , -0.5901  , -0.015908, -0.082797, -1.2261  ,
         0.098286,  0.087488,  0.012586, -0.35884 ,  0.80733 ,  0.12569 ,
        -4.0522  ,  0.14856 ,  0.6988  , -0.78948 , -0.77125 ,  0.49512 ,
         0.16366 , -0.9713  ,  0.95064 ,  0.19921 , -0.27903 , -1.6844  ,
        -0.79424 ], dtype=float32))

Rank phrases by similarity

The goal here is given a query phrase, rank all other phrases by semantic similarity (using the glove twitter embeddings) and compare that with surface level similarity using jaccard similarity index

In [112]:

import pandas as pd
from sklearn.metrics import jaccard_similarity_score

phrases=["barrack obama","barrack h. obama","barrack hussein obama","michelle obama","donald trump","melania trump"]
query="barack hussain obama"

results_glove=[]
results_jaccard=[]

def compute_jaccard(t1,t2):
    
    intersect = [value for value in t1 if value in t2] 
    
    union=[]
    union.extend(t1)
    union.extend(t2)
    union=list(set(union))
    
    
    jaccard=(len(intersect))/(len(union)+0.01)
    return jaccard
    

for p in phrases:
    tokens_1=[t for t in p.split() if t in model.wv.vocab]
    tokens_2=[t for t in query.split() if t in model.wv.vocab]
    
    #compute jaccard similarity
    jaccard=compute_jaccard(tokens_1,tokens_2)
    results_jaccard.append([p,jaccard])
    
    #compute cosine similarity using word embedings 
    cosine=0
    if (len(tokens_1) > 0 and len(tokens_2)>0):
        cosine=model_glove_twitter.wv.n_similarity(tokens_1,tokens_2)
        results_glove.append([p,cosine])

print("Phrases most similar to '{0}' using glove word embeddings".format(query))
pd.DataFrame(results_glove,columns=["phrase","score"]).sort_values(by=["score"],ascending=False)

Out[112]:

Phrases most similar to 'barack hussain obama' using glove word embeddings

In [113]:

print("Phrases most similar to '{0}' using jaccard similarity".format(query))
pd.DataFrame(results_jaccard,columns=["phrase","score"]).sort_values(by=["score"],ascending=False)

Out[113]:

Phrases most similar to 'barack hussain obama' using jaccard similarity

Pre-trainend: GloVe Wikipedia + Gigaword

The example below uses pre-trained GloVe vectors based on Wikipedia 2014 and Gigaword. The original source of these embeddings can be found here: https://nlp.stanford.edu/projects/glove/

In [98]:

#again, download and load the model
model_gigaword = api.load("glove-wiki-gigaword-100")

Out[98]:

[==================================================] 100.0% 128.1/128.1MB downloaded

In [99]:

# find similarity
model_gigaword.wv.most_similar(positive=['dirty','grimy'],topn=10)

Out[99]:

[('filthy', 0.7690386176109314),
 ('smelly', 0.7392697334289551),
 ('shabby', 0.7025482654571533),
 ('dingy', 0.7022336721420288),
 ('grubby', 0.675451397895813),
 ('grungy', 0.6414023637771606),
 ('dank', 0.626369833946228),
 ('sweaty', 0.622745156288147),
 ('dreary', 0.6216243505477905),
 ('gritty', 0.621574878692627)]

In [100]:

model_gigaword.wv.most_similar(positive=["summer","winter"],topn=10)

Out[100]:

[('spring', 0.8519278764724731),
 ('autumn', 0.7865706086158752),
 ('olympics', 0.6915044784545898),
 ('weekend', 0.6908973455429077),
 ('days', 0.6872981786727905),
 ('during', 0.6861997842788696),
 ('season', 0.6849778890609741),
 ('year', 0.6827663779258728),
 ('rainy', 0.6744829416275024),
 ('day', 0.671191930770874)]

Load a dataset and train a model

Instead of loading pre-trained embeddings, you can also load a corpus and train it on demand. This list of datasets that you can download can be found here: https://github.com/RaRe-Technologies/gensim-data#datasets

In [101]:

from gensim.models.word2vec import Word2Vec

# this loads the text8 dataset
corpus = api.load('text8')

# train a Word2Vec model
model_text8 = Word2Vec(corpus,iter=10,size=150, window=10, min_count=2, workers=10)  # train a model from the corpus

Out[101]:

[==================================================] 100.0% 31.6/31.6MB downloaded

In [102]:

# similarity 
model_text8.wv.most_similar("shocked")

Out[102]:

[('outraged', 0.7200734615325928),
 ('surprised', 0.6967819333076477),
 ('greeted', 0.6692871451377869),
 ('angered', 0.6468496322631836),
 ('confronted', 0.6217055320739746),
 ('beaten', 0.6206371188163757),
 ('betrayed', 0.6194607019424438),
 ('disgusted', 0.6146512031555176),
 ('amused', 0.6022583842277527),
 ('offended', 0.6014840602874756)]

In [103]:

# similarity between two different words
model_text8.wv.similarity(w1="dirty",w2="smelly")

Out[103]:

0.45678782

In [104]:

# Which one is the odd one out in this list?
model_text8.wv.doesnt_match(["cat","dog","france"])

Out[104]:

'france'

Using pre-trained embeddings and NLP corpora

Imports

Pre-trained: Twitter GloVe Embeddings

Rank phrases by similarity

Pre-trainend: GloVe Wikipedia + Gigaword

Load a dataset and train a model

Product

Resources

Company