Path: blob/master/pre-trained-embeddings/Pre-trained embeddings.ipynb
314 views
Using pre-trained embeddings and NLP corpora
Gensim has some really nice functionality, in that it allows you to use pre-trained GloVe and Word2Vec embeddings with its libraries. In addition there are also some re-usable corpora that you can download and immediately use to train a Word2Vec embedding. The code snippets below show you how. The source of the embeddings can be found here: https://github.com/RaRe-Technologies/gensim-data.
I'll have to warn you that I'm not impressed with the quality of the pre-trained word embeddings. Either the dataset is noisy or its just too general. To be explained more later.
Imports
Pre-trained: Twitter GloVe Embeddings
This first step downloads the pre-trained embeddings and loads it for re-use. Note that these are GloVe embeddings built using Tweets as the name suggests. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source can be found here: https://nlp.stanford.edu/projects/glove/. The 25 in the model name refers to the dimensionality of the vectors.
Once you have loaded the pre-trained model, just use it as you would with any gensim word2vec model. Here are a few similarity examples:
Which of these words don't fit?
Word vectors for trump and obama
Rank phrases by similarity
The goal here is given a query phrase, rank all other phrases by semantic similarity (using the glove twitter embeddings) and compare that with surface level similarity using jaccard similarity index
Pre-trainend: GloVe Wikipedia + Gigaword
The example below uses pre-trained GloVe vectors based on Wikipedia 2014 and Gigaword. The original source of these embeddings can be found here: https://nlp.stanford.edu/projects/glove/
Load a dataset and train a model
Instead of loading pre-trained embeddings, you can also load a corpus and train it on demand. This list of datasets that you can download can be found here: https://github.com/RaRe-Technologies/gensim-data#datasets