Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
kavgan
GitHub Repository: kavgan/nlp-in-practice
Path: blob/master/pre-trained-embeddings/Pre-trained embeddings.ipynb
314 views
Kernel: Python 3

Using pre-trained embeddings and NLP corpora

Gensim has some really nice functionality, in that it allows you to use pre-trained GloVe and Word2Vec embeddings with its libraries. In addition there are also some re-usable corpora that you can download and immediately use to train a Word2Vec embedding. The code snippets below show you how. The source of the embeddings can be found here: https://github.com/RaRe-Technologies/gensim-data.

I'll have to warn you that I'm not impressed with the quality of the pre-trained word embeddings. Either the dataset is noisy or its just too general. To be explained more later.

import warnings warnings.filterwarnings('ignore')

Imports

from gensim.models.word2vec import Word2Vec import gensim.downloader as api

Pre-trained: Twitter GloVe Embeddings

This first step downloads the pre-trained embeddings and loads it for re-use. Note that these are GloVe embeddings built using Tweets as the name suggests. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source can be found here: https://nlp.stanford.edu/projects/glove/. The 25 in the model name refers to the dimensionality of the vectors.

# download the model and return as object ready for use model_glove_twitter = api.load("glove-twitter-25")

Once you have loaded the pre-trained model, just use it as you would with any gensim word2vec model. Here are a few similarity examples:

model_glove_twitter.wv.most_similar("pelosi",topn=10)
[('clegg', 0.9653651118278503), ('miliband', 0.9515050053596497), ('bachmann', 0.9484400749206543), ('mcconnell', 0.9416399002075195), ('carney', 0.934025764465332), ('coulter', 0.9311323761940002), ('boehner', 0.9286302328109741), ('santorum', 0.9269059300422668), ('farage', 0.919365406036377), ('mourdock', 0.9186689853668213)]
model_glove_twitter.wv.most_similar("policies",topn=10)
[('policy', 0.9484813213348389), ('reforms', 0.9403933882713318), ('laws', 0.94012051820755), ('government', 0.923071026802063), ('regulations', 0.916893482208252), ('economy', 0.9110006093978882), ('immigration', 0.9105910062789917), ('legislation', 0.908964991569519), ('govt', 0.9054746627807617), ('regulation', 0.9050779342651367)]

Which of these words don't fit?

#what doesn't fit? model_glove_twitter.wv.doesnt_match(["trump","bernie","obama","pelosi","orange"])
'orange'

Word vectors for trump and obama

# show weight vector for trump and obama model_glove_twitter["trump"],model_glove_twitter['obama']
(array([-0.56174 , 0.69419 , 0.16733 , 0.055867, -0.26266 , -0.6303 , -0.28311 , -0.88244 , 0.57317 , -0.82376 , 0.46728 , 0.48607 , -2.1942 , -0.41972 , 0.31795 , -0.70063 , 0.060693, 0.45279 , 0.6564 , 0.20738 , 0.84496 , -0.087537, -0.38856 , -0.97028 , -0.40427 ], dtype=float32), array([ 0.77126 , 0.81259 , -0.5901 , -0.015908, -0.082797, -1.2261 , 0.098286, 0.087488, 0.012586, -0.35884 , 0.80733 , 0.12569 , -4.0522 , 0.14856 , 0.6988 , -0.78948 , -0.77125 , 0.49512 , 0.16366 , -0.9713 , 0.95064 , 0.19921 , -0.27903 , -1.6844 , -0.79424 ], dtype=float32))

Rank phrases by similarity

The goal here is given a query phrase, rank all other phrases by semantic similarity (using the glove twitter embeddings) and compare that with surface level similarity using jaccard similarity index

import pandas as pd from sklearn.metrics import jaccard_similarity_score phrases=["barrack obama","barrack h. obama","barrack hussein obama","michelle obama","donald trump","melania trump"] query="barack hussain obama" results_glove=[] results_jaccard=[] def compute_jaccard(t1,t2): intersect = [value for value in t1 if value in t2] union=[] union.extend(t1) union.extend(t2) union=list(set(union)) jaccard=(len(intersect))/(len(union)+0.01) return jaccard for p in phrases: tokens_1=[t for t in p.split() if t in model.wv.vocab] tokens_2=[t for t in query.split() if t in model.wv.vocab] #compute jaccard similarity jaccard=compute_jaccard(tokens_1,tokens_2) results_jaccard.append([p,jaccard]) #compute cosine similarity using word embedings cosine=0 if (len(tokens_1) > 0 and len(tokens_2)>0): cosine=model_glove_twitter.wv.n_similarity(tokens_1,tokens_2) results_glove.append([p,cosine]) print("Phrases most similar to '{0}' using glove word embeddings".format(query)) pd.DataFrame(results_glove,columns=["phrase","score"]).sort_values(by=["score"],ascending=False)
Phrases most similar to 'barack hussain obama' using glove word embeddings
print("Phrases most similar to '{0}' using jaccard similarity".format(query)) pd.DataFrame(results_jaccard,columns=["phrase","score"]).sort_values(by=["score"],ascending=False)
Phrases most similar to 'barack hussain obama' using jaccard similarity

Pre-trainend: GloVe Wikipedia + Gigaword

The example below uses pre-trained GloVe vectors based on Wikipedia 2014 and Gigaword. The original source of these embeddings can be found here: https://nlp.stanford.edu/projects/glove/

#again, download and load the model model_gigaword = api.load("glove-wiki-gigaword-100")
[==================================================] 100.0% 128.1/128.1MB downloaded
# find similarity model_gigaword.wv.most_similar(positive=['dirty','grimy'],topn=10)
[('filthy', 0.7690386176109314), ('smelly', 0.7392697334289551), ('shabby', 0.7025482654571533), ('dingy', 0.7022336721420288), ('grubby', 0.675451397895813), ('grungy', 0.6414023637771606), ('dank', 0.626369833946228), ('sweaty', 0.622745156288147), ('dreary', 0.6216243505477905), ('gritty', 0.621574878692627)]
model_gigaword.wv.most_similar(positive=["summer","winter"],topn=10)
[('spring', 0.8519278764724731), ('autumn', 0.7865706086158752), ('olympics', 0.6915044784545898), ('weekend', 0.6908973455429077), ('days', 0.6872981786727905), ('during', 0.6861997842788696), ('season', 0.6849778890609741), ('year', 0.6827663779258728), ('rainy', 0.6744829416275024), ('day', 0.671191930770874)]

Load a dataset and train a model

Instead of loading pre-trained embeddings, you can also load a corpus and train it on demand. This list of datasets that you can download can be found here: https://github.com/RaRe-Technologies/gensim-data#datasets

from gensim.models.word2vec import Word2Vec # this loads the text8 dataset corpus = api.load('text8') # train a Word2Vec model model_text8 = Word2Vec(corpus,iter=10,size=150, window=10, min_count=2, workers=10) # train a model from the corpus
[==================================================] 100.0% 31.6/31.6MB downloaded
# similarity model_text8.wv.most_similar("shocked")
[('outraged', 0.7200734615325928), ('surprised', 0.6967819333076477), ('greeted', 0.6692871451377869), ('angered', 0.6468496322631836), ('confronted', 0.6217055320739746), ('beaten', 0.6206371188163757), ('betrayed', 0.6194607019424438), ('disgusted', 0.6146512031555176), ('amused', 0.6022583842277527), ('offended', 0.6014840602874756)]
# similarity between two different words model_text8.wv.similarity(w1="dirty",w2="smelly")
0.45678782
# Which one is the odd one out in this list? model_text8.wv.doesnt_match(["cat","dog","france"])
'france'