Path: blob/master/42_word2vec_gensim/42_word2vec_gensim.ipynb
1141 views
Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.
Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz
Simple Preprocessing & Tokenization
The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.
Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.
Training the Word2Vec Model
Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.
Workers define how many CPU threads to be used.
Initialize the model
Build Vocabulary
Train the Word2Vec Model
Save the Model
Save the model so that it can be reused in other applications
Finding Similar Words and Similarity between words
Further Reading
You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html
Explore other Datasets related to Amazon Reviews: http://jmcauley.ucsd.edu/data/amazon/
Exercise
Train a word2vec model on the Sports & Outdoors Reviews Dataset Once you train a model on this, find the words most similar to 'awful' and find similarities between the following word tuples: ('good', 'great'), ('slow','steady')
Click here for solution.