Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
kavgan
GitHub Repository: kavgan/nlp-in-practice
Path: blob/master/word2vec/Word2Vec.ipynb
314 views
Kernel: Python 3

Getting started with Word2Vec in Gensim and making it work!

The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying show me your friends, and I'll tell who you are. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words shocked,appalled and astonished are typically used in a similar context.

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work! I have heard a lot of complaints about poor performance etc, but its really a combination of two things, (1) your input data and (2) your parameter settings. Note that the training algorithms in this package were ported from the original Word2Vec implementation by Google and extended with additional functionality.

Imports and logging

First, we start with our imports and get logging established:

# imports needed and set up logging import gzip import gensim import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Dataset

Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. In this case I am going to use data from the OpinRank dataset. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97MB compressed and 229MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review. You can download the OpinRank Word2Vec dataset here.

To avoid confusion, while gensim’s word2vec tutorial says that you need to pass it a sequence of sentences as its input, you can always pass it a whole review as a sentence (i.e. a much larger size of text), and it should not make much of a difference.

Now, let's take a closer look at this data below by printing the first line. You can see that this is a pretty hefty review.

data_file="reviews_data.txt.gz" with gzip.open ('reviews_data.txt.gz', 'rb') as f: for i,line in enumerate (f): print(line) break
b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Beijing, then you will be ok.I chose to have some breakfast in the hotel, which was really tasty and there was a good selection of dishes. There are a couple of computers to use in the communal area, as well as a pool table. There is also a small swimming pool and a gym area.I would definitely stay in this hotel again, but only if I did not plan to travel to central Beijing, as it can take a long time. The location is ok if you plan to do a lot of shopping, as there is a big shopping centre just few minutes away from the hotel and there are plenty of eating options around, including restaurants that serve a dog meat!\t\r\n"

Read files into a list

Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the compressed file. I'm also doing a mild pre-processing of the reviews using gensim.utils.simple_preprocess (line). This does some basic pre-processing such as tokenization, lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official Gensim documentation site.

def read_input(input_file): """This method reads the input file which is in gzip format""" logging.info("reading file {0}...this may take a while".format(input_file)) with gzip.open (input_file, 'rb') as f: for i, line in enumerate (f): if (i%10000==0): logging.info ("read {0} reviews".format (i)) # do some pre-processing and return a list of words for each review text yield gensim.utils.simple_preprocess (line) # read the tokenized reviews into a list # each review item becomes a serries of words # so this becomes a list of lists documents = list (read_input (data_file)) logging.info ("Done reading data file")
2018-01-28 00:27:46,482 : INFO : reading file reviews_data.txt.gz...this may take a while 2018-01-28 00:27:46,484 : INFO : read 0 reviews 2018-01-28 00:27:48,868 : INFO : read 10000 reviews 2018-01-28 00:27:51,350 : INFO : read 20000 reviews 2018-01-28 00:27:54,287 : INFO : read 30000 reviews 2018-01-28 00:27:57,177 : INFO : read 40000 reviews 2018-01-28 00:28:00,147 : INFO : read 50000 reviews 2018-01-28 00:28:03,028 : INFO : read 60000 reviews 2018-01-28 00:28:05,508 : INFO : read 70000 reviews 2018-01-28 00:28:08,176 : INFO : read 80000 reviews 2018-01-28 00:28:10,532 : INFO : read 90000 reviews 2018-01-28 00:28:12,768 : INFO : read 100000 reviews 2018-01-28 00:28:14,962 : INFO : read 110000 reviews 2018-01-28 00:28:17,314 : INFO : read 120000 reviews 2018-01-28 00:28:19,624 : INFO : read 130000 reviews 2018-01-28 00:28:21,985 : INFO : read 140000 reviews 2018-01-28 00:28:24,178 : INFO : read 150000 reviews 2018-01-28 00:28:26,464 : INFO : read 160000 reviews 2018-01-28 00:28:29,481 : INFO : read 170000 reviews 2018-01-28 00:28:31,808 : INFO : read 180000 reviews 2018-01-28 00:28:34,095 : INFO : read 190000 reviews 2018-01-28 00:28:36,597 : INFO : read 200000 reviews 2018-01-28 00:28:39,192 : INFO : read 210000 reviews 2018-01-28 00:28:41,684 : INFO : read 220000 reviews 2018-01-28 00:28:43,871 : INFO : read 230000 reviews 2018-01-28 00:28:46,247 : INFO : read 240000 reviews 2018-01-28 00:28:48,548 : INFO : read 250000 reviews 2018-01-28 00:28:50,053 : INFO : Done reading data file

Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the documents). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call train(...) to start training the Word2Vec model. Training on the OpinRank dataset takes about 10 minutes so please be patient while running your code on this dataset.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.

model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10) model.train(documents,total_examples=len(documents),epochs=10)
2018-01-26 22:57:00,707 : WARNING : consider setting layer size to a multiple of 4 for greater performance 2018-01-26 22:57:00,709 : INFO : collecting all words and their counts 2018-01-26 22:57:00,710 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2018-01-26 22:57:01,045 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types 2018-01-26 22:57:01,399 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types 2018-01-26 22:57:01,803 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types 2018-01-26 22:57:02,188 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types 2018-01-26 22:57:02,560 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types 2018-01-26 22:57:02,905 : INFO : PROGRESS: at sentence #60000, processed 11013723 words, keeping 76781 word types 2018-01-26 22:57:03,202 : INFO : PROGRESS: at sentence #70000, processed 12637525 words, keeping 83194 word types 2018-01-26 22:57:03,604 : INFO : PROGRESS: at sentence #80000, processed 14099751 words, keeping 88454 word types 2018-01-26 22:57:03,909 : INFO : PROGRESS: at sentence #90000, processed 15662149 words, keeping 93352 word types 2018-01-26 22:57:04,174 : INFO : PROGRESS: at sentence #100000, processed 17164487 words, keeping 97881 word types 2018-01-26 22:57:04,443 : INFO : PROGRESS: at sentence #110000, processed 18652292 words, keeping 102127 word types 2018-01-26 22:57:04,724 : INFO : PROGRESS: at sentence #120000, processed 20152529 words, keeping 105918 word types 2018-01-26 22:57:04,992 : INFO : PROGRESS: at sentence #130000, processed 21684330 words, keeping 110099 word types 2018-01-26 22:57:05,424 : INFO : PROGRESS: at sentence #140000, processed 23330206 words, keeping 114103 word types 2018-01-26 22:57:05,837 : INFO : PROGRESS: at sentence #150000, processed 24838754 words, keeping 118169 word types 2018-01-26 22:57:06,261 : INFO : PROGRESS: at sentence #160000, processed 26390910 words, keeping 118665 word types 2018-01-26 22:57:06,606 : INFO : PROGRESS: at sentence #170000, processed 27913916 words, keeping 123350 word types 2018-01-26 22:57:06,902 : INFO : PROGRESS: at sentence #180000, processed 29535612 words, keeping 126742 word types 2018-01-26 22:57:07,191 : INFO : PROGRESS: at sentence #190000, processed 31096459 words, keeping 129841 word types 2018-01-26 22:57:07,495 : INFO : PROGRESS: at sentence #200000, processed 32805271 words, keeping 133249 word types 2018-01-26 22:57:07,774 : INFO : PROGRESS: at sentence #210000, processed 34434198 words, keeping 136358 word types 2018-01-26 22:57:08,086 : INFO : PROGRESS: at sentence #220000, processed 36083482 words, keeping 139412 word types 2018-01-26 22:57:08,367 : INFO : PROGRESS: at sentence #230000, processed 37571762 words, keeping 142393 word types 2018-01-26 22:57:08,669 : INFO : PROGRESS: at sentence #240000, processed 39138190 words, keeping 145226 word types 2018-01-26 22:57:08,951 : INFO : PROGRESS: at sentence #250000, processed 40695049 words, keeping 147960 word types 2018-01-26 22:57:09,118 : INFO : collected 150053 word types from a corpus of 41519355 raw words and 255404 sentences 2018-01-26 22:57:09,119 : INFO : Loading a fresh vocabulary 2018-01-26 22:57:10,111 : INFO : min_count=2 retains 70538 unique words (47% of original 150053, drops 79515) 2018-01-26 22:57:10,112 : INFO : min_count=2 leaves 41439840 word corpus (99% of original 41519355, drops 79515) 2018-01-26 22:57:10,297 : INFO : deleting the raw counts dictionary of 150053 items 2018-01-26 22:57:10,303 : INFO : sample=0.001 downsamples 55 most-common words 2018-01-26 22:57:10,303 : INFO : downsampling leaves estimated 30349255 word corpus (73.2% of prior 41439840) 2018-01-26 22:57:10,304 : INFO : estimated required memory for 70538 words and 150 dimensions: 119914600 bytes 2018-01-26 22:57:10,567 : INFO : resetting layer weights 2018-01-26 22:57:11,412 : INFO : training model with 10 workers on 70538 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5 window=10 2018-01-26 22:57:12,424 : INFO : PROGRESS: at 1.05% examples, 1630633 words/s, in_qsize 16, out_qsize 3 2018-01-26 22:57:13,428 : INFO : PROGRESS: at 1.98% examples, 1573534 words/s, in_qsize 16, out_qsize 3 2018-01-26 22:57:14,446 : INFO : PROGRESS: at 2.84% examples, 1550376 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:15,448 : INFO : PROGRESS: at 3.62% examples, 1511761 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:16,461 : INFO : PROGRESS: at 4.46% examples, 1506343 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:17,472 : INFO : PROGRESS: at 5.34% examples, 1496361 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:18,477 : INFO : PROGRESS: at 6.47% examples, 1502239 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:19,488 : INFO : PROGRESS: at 7.49% examples, 1497357 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:20,494 : INFO : PROGRESS: at 8.41% examples, 1471156 words/s, in_qsize 19, out_qsize 3 2018-01-26 22:57:21,501 : INFO : PROGRESS: at 9.23% examples, 1435757 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:22,504 : INFO : PROGRESS: at 9.84% examples, 1385686 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:57:23,513 : INFO : PROGRESS: at 10.73% examples, 1382376 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:24,524 : INFO : PROGRESS: at 11.80% examples, 1391549 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:25,532 : INFO : PROGRESS: at 12.85% examples, 1397274 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:26,532 : INFO : PROGRESS: at 13.87% examples, 1405490 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:27,538 : INFO : PROGRESS: at 14.91% examples, 1413780 words/s, in_qsize 20, out_qsize 0 2018-01-26 22:57:28,542 : INFO : PROGRESS: at 15.76% examples, 1409902 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:57:29,544 : INFO : PROGRESS: at 16.77% examples, 1415793 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:30,553 : INFO : PROGRESS: at 17.53% examples, 1400212 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:57:31,559 : INFO : PROGRESS: at 18.30% examples, 1384035 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:32,564 : INFO : PROGRESS: at 18.86% examples, 1357194 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:33,578 : INFO : PROGRESS: at 19.76% examples, 1354258 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:34,600 : INFO : PROGRESS: at 20.65% examples, 1352727 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:35,610 : INFO : PROGRESS: at 21.52% examples, 1351669 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:36,614 : INFO : PROGRESS: at 22.26% examples, 1349588 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:37,622 : INFO : PROGRESS: at 23.01% examples, 1348751 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:38,638 : INFO : PROGRESS: at 23.76% examples, 1347862 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:39,645 : INFO : PROGRESS: at 24.46% examples, 1344198 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:40,652 : INFO : PROGRESS: at 25.20% examples, 1341222 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:41,664 : INFO : PROGRESS: at 26.13% examples, 1338831 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:42,668 : INFO : PROGRESS: at 27.01% examples, 1336250 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:43,669 : INFO : PROGRESS: at 27.93% examples, 1335194 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:44,671 : INFO : PROGRESS: at 28.95% examples, 1336363 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:45,675 : INFO : PROGRESS: at 29.96% examples, 1339479 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:46,679 : INFO : PROGRESS: at 30.89% examples, 1341175 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:47,684 : INFO : PROGRESS: at 31.97% examples, 1346648 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:48,689 : INFO : PROGRESS: at 33.00% examples, 1348537 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:49,690 : INFO : PROGRESS: at 33.87% examples, 1348166 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:50,693 : INFO : PROGRESS: at 34.82% examples, 1349359 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:51,698 : INFO : PROGRESS: at 35.68% examples, 1349834 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:57:52,698 : INFO : PROGRESS: at 36.50% examples, 1346949 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:53,708 : INFO : PROGRESS: at 37.32% examples, 1344214 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:54,718 : INFO : PROGRESS: at 38.11% examples, 1338556 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:55,722 : INFO : PROGRESS: at 38.98% examples, 1336874 words/s, in_qsize 16, out_qsize 3 2018-01-26 22:57:56,725 : INFO : PROGRESS: at 39.89% examples, 1336433 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:57:57,734 : INFO : PROGRESS: at 40.79% examples, 1337006 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:57:58,737 : INFO : PROGRESS: at 41.64% examples, 1336655 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:57:59,755 : INFO : PROGRESS: at 42.35% examples, 1335389 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:00,762 : INFO : PROGRESS: at 42.99% examples, 1330720 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:01,770 : INFO : PROGRESS: at 43.69% examples, 1328709 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:02,777 : INFO : PROGRESS: at 44.35% examples, 1325646 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:03,778 : INFO : PROGRESS: at 44.94% examples, 1321895 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:04,782 : INFO : PROGRESS: at 45.73% examples, 1316902 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:05,790 : INFO : PROGRESS: at 46.61% examples, 1315035 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:58:06,793 : INFO : PROGRESS: at 47.43% examples, 1313003 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:07,798 : INFO : PROGRESS: at 48.46% examples, 1314648 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:08,801 : INFO : PROGRESS: at 49.43% examples, 1315526 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:09,804 : INFO : PROGRESS: at 50.36% examples, 1315966 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:10,806 : INFO : PROGRESS: at 51.18% examples, 1313952 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:11,806 : INFO : PROGRESS: at 52.14% examples, 1315414 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:12,812 : INFO : PROGRESS: at 53.03% examples, 1313837 words/s, in_qsize 16, out_qsize 3 2018-01-26 22:58:13,814 : INFO : PROGRESS: at 53.89% examples, 1313909 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:14,822 : INFO : PROGRESS: at 54.74% examples, 1312597 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:15,824 : INFO : PROGRESS: at 55.58% examples, 1312952 words/s, in_qsize 19, out_qsize 2 2018-01-26 22:58:16,831 : INFO : PROGRESS: at 56.44% examples, 1312647 words/s, in_qsize 20, out_qsize 1 2018-01-26 22:58:17,834 : INFO : PROGRESS: at 57.28% examples, 1312103 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:18,847 : INFO : PROGRESS: at 58.25% examples, 1312649 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:19,852 : INFO : PROGRESS: at 59.19% examples, 1313508 words/s, in_qsize 20, out_qsize 2 2018-01-26 22:58:20,859 : INFO : PROGRESS: at 60.11% examples, 1313863 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:21,859 : INFO : PROGRESS: at 60.91% examples, 1312507 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:58:22,862 : INFO : PROGRESS: at 61.76% examples, 1312689 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:23,865 : INFO : PROGRESS: at 62.38% examples, 1310562 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:24,865 : INFO : PROGRESS: at 63.03% examples, 1308189 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:25,869 : INFO : PROGRESS: at 63.55% examples, 1303067 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:58:26,883 : INFO : PROGRESS: at 64.32% examples, 1303472 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:27,908 : INFO : PROGRESS: at 64.96% examples, 1301893 words/s, in_qsize 16, out_qsize 3 2018-01-26 22:58:28,914 : INFO : PROGRESS: at 65.97% examples, 1302856 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:58:29,918 : INFO : PROGRESS: at 66.89% examples, 1302758 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:30,921 : INFO : PROGRESS: at 67.87% examples, 1303722 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:31,924 : INFO : PROGRESS: at 68.80% examples, 1303509 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:32,939 : INFO : PROGRESS: at 69.70% examples, 1302910 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:33,942 : INFO : PROGRESS: at 70.51% examples, 1301366 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:34,946 : INFO : PROGRESS: at 71.26% examples, 1298951 words/s, in_qsize 20, out_qsize 0 2018-01-26 22:58:35,958 : INFO : PROGRESS: at 72.04% examples, 1296838 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:36,961 : INFO : PROGRESS: at 72.97% examples, 1296579 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:37,967 : INFO : PROGRESS: at 73.82% examples, 1296453 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:38,971 : INFO : PROGRESS: at 74.69% examples, 1296401 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:39,979 : INFO : PROGRESS: at 75.51% examples, 1296175 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:40,982 : INFO : PROGRESS: at 76.37% examples, 1296382 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:41,994 : INFO : PROGRESS: at 77.22% examples, 1296265 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:42,998 : INFO : PROGRESS: at 78.10% examples, 1295476 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:58:44,010 : INFO : PROGRESS: at 78.87% examples, 1293468 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:45,013 : INFO : PROGRESS: at 79.67% examples, 1292078 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:46,018 : INFO : PROGRESS: at 80.35% examples, 1289189 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:47,028 : INFO : PROGRESS: at 81.18% examples, 1288842 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:48,037 : INFO : PROGRESS: at 81.96% examples, 1288430 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:49,057 : INFO : PROGRESS: at 82.71% examples, 1288958 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:50,072 : INFO : PROGRESS: at 83.45% examples, 1289004 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:58:51,079 : INFO : PROGRESS: at 84.16% examples, 1289722 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:52,081 : INFO : PROGRESS: at 84.89% examples, 1289713 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:53,081 : INFO : PROGRESS: at 85.83% examples, 1289665 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:58:54,085 : INFO : PROGRESS: at 86.62% examples, 1287797 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:55,092 : INFO : PROGRESS: at 87.39% examples, 1285958 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:56,103 : INFO : PROGRESS: at 88.06% examples, 1282682 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:57,110 : INFO : PROGRESS: at 88.83% examples, 1280349 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:58:58,111 : INFO : PROGRESS: at 89.70% examples, 1280010 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:58:59,112 : INFO : PROGRESS: at 90.54% examples, 1279538 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:00,115 : INFO : PROGRESS: at 91.32% examples, 1278168 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:01,129 : INFO : PROGRESS: at 92.19% examples, 1277834 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:02,135 : INFO : PROGRESS: at 92.98% examples, 1276000 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:03,158 : INFO : PROGRESS: at 93.80% examples, 1275501 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:04,176 : INFO : PROGRESS: at 94.56% examples, 1274151 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:05,184 : INFO : PROGRESS: at 95.26% examples, 1272229 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:06,190 : INFO : PROGRESS: at 95.96% examples, 1270602 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:07,196 : INFO : PROGRESS: at 96.63% examples, 1268278 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:59:08,198 : INFO : PROGRESS: at 97.46% examples, 1268035 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:09,206 : INFO : PROGRESS: at 98.34% examples, 1267809 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:10,210 : INFO : PROGRESS: at 99.15% examples, 1267124 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:11,100 : INFO : worker thread finished; awaiting finish of 9 more threads 2018-01-26 22:59:11,113 : INFO : worker thread finished; awaiting finish of 8 more threads 2018-01-26 22:59:11,128 : INFO : worker thread finished; awaiting finish of 7 more threads 2018-01-26 22:59:11,134 : INFO : worker thread finished; awaiting finish of 6 more threads 2018-01-26 22:59:11,142 : INFO : worker thread finished; awaiting finish of 5 more threads 2018-01-26 22:59:11,143 : INFO : worker thread finished; awaiting finish of 4 more threads 2018-01-26 22:59:11,145 : INFO : worker thread finished; awaiting finish of 3 more threads 2018-01-26 22:59:11,147 : INFO : worker thread finished; awaiting finish of 2 more threads 2018-01-26 22:59:11,162 : INFO : worker thread finished; awaiting finish of 1 more threads 2018-01-26 22:59:11,170 : INFO : worker thread finished; awaiting finish of 0 more threads 2018-01-26 22:59:11,171 : INFO : training on 207596775 raw words (151742453 effective words) took 119.8s, 1267147 effective words/s 2018-01-26 22:59:11,797 : INFO : training model with 10 workers on 70538 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5 window=10 2018-01-26 22:59:12,805 : INFO : PROGRESS: at 0.43% examples, 1317649 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:13,812 : INFO : PROGRESS: at 0.86% examples, 1326872 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:14,829 : INFO : PROGRESS: at 1.17% examples, 1266064 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:15,830 : INFO : PROGRESS: at 1.49% examples, 1226477 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:16,842 : INFO : PROGRESS: at 1.80% examples, 1201967 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:59:17,843 : INFO : PROGRESS: at 2.10% examples, 1192827 words/s, in_qsize 20, out_qsize 0 2018-01-26 22:59:18,850 : INFO : PROGRESS: at 2.45% examples, 1197318 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:19,864 : INFO : PROGRESS: at 2.89% examples, 1197217 words/s, in_qsize 17, out_qsize 3 2018-01-26 22:59:20,882 : INFO : PROGRESS: at 3.36% examples, 1207477 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:21,885 : INFO : PROGRESS: at 3.79% examples, 1212668 words/s, in_qsize 20, out_qsize 2 2018-01-26 22:59:22,888 : INFO : PROGRESS: at 4.24% examples, 1213387 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:23,897 : INFO : PROGRESS: at 4.67% examples, 1210431 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:59:24,901 : INFO : PROGRESS: at 5.07% examples, 1205675 words/s, in_qsize 20, out_qsize 0 2018-01-26 22:59:25,908 : INFO : PROGRESS: at 5.41% examples, 1192584 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:26,909 : INFO : PROGRESS: at 5.79% examples, 1186296 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:27,912 : INFO : PROGRESS: at 6.15% examples, 1176714 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:28,926 : INFO : PROGRESS: at 6.60% examples, 1181550 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:29,936 : INFO : PROGRESS: at 7.00% examples, 1181825 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:30,943 : INFO : PROGRESS: at 7.43% examples, 1186717 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:31,951 : INFO : PROGRESS: at 7.82% examples, 1188840 words/s, in_qsize 20, out_qsize 0 2018-01-26 22:59:32,955 : INFO : PROGRESS: at 8.18% examples, 1184116 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:33,969 : INFO : PROGRESS: at 8.52% examples, 1177660 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:34,974 : INFO : PROGRESS: at 8.91% examples, 1173614 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:35,977 : INFO : PROGRESS: at 9.28% examples, 1168527 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:36,977 : INFO : PROGRESS: at 9.60% examples, 1159969 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:37,982 : INFO : PROGRESS: at 9.93% examples, 1151239 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:38,990 : INFO : PROGRESS: at 10.28% examples, 1148083 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:39,992 : INFO : PROGRESS: at 10.65% examples, 1147883 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:41,033 : INFO : PROGRESS: at 10.99% examples, 1145730 words/s, in_qsize 14, out_qsize 5 2018-01-26 22:59:42,053 : INFO : PROGRESS: at 11.31% examples, 1146120 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:43,064 : INFO : PROGRESS: at 11.66% examples, 1147334 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:44,067 : INFO : PROGRESS: at 11.94% examples, 1145272 words/s, in_qsize 20, out_qsize 1 2018-01-26 22:59:45,076 : INFO : PROGRESS: at 12.30% examples, 1147350 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:46,079 : INFO : PROGRESS: at 12.63% examples, 1145953 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:47,084 : INFO : PROGRESS: at 12.98% examples, 1140486 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:48,093 : INFO : PROGRESS: at 13.36% examples, 1138801 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:49,096 : INFO : PROGRESS: at 13.74% examples, 1137427 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:59:50,098 : INFO : PROGRESS: at 14.16% examples, 1137792 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:51,105 : INFO : PROGRESS: at 14.59% examples, 1138475 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:52,124 : INFO : PROGRESS: at 14.94% examples, 1135064 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:53,127 : INFO : PROGRESS: at 15.27% examples, 1131159 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:54,142 : INFO : PROGRESS: at 15.67% examples, 1131675 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:59:55,153 : INFO : PROGRESS: at 16.09% examples, 1133308 words/s, in_qsize 19, out_qsize 0 2018-01-26 22:59:56,165 : INFO : PROGRESS: at 16.48% examples, 1131847 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:57,170 : INFO : PROGRESS: at 16.80% examples, 1127814 words/s, in_qsize 18, out_qsize 1 2018-01-26 22:59:58,182 : INFO : PROGRESS: at 17.13% examples, 1125581 words/s, in_qsize 17, out_qsize 2 2018-01-26 22:59:59,200 : INFO : PROGRESS: at 17.47% examples, 1121911 words/s, in_qsize 20, out_qsize 1 2018-01-26 23:00:00,209 : INFO : PROGRESS: at 17.74% examples, 1116672 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:01,213 : INFO : PROGRESS: at 18.10% examples, 1116419 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:02,219 : INFO : PROGRESS: at 18.48% examples, 1116982 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:03,231 : INFO : PROGRESS: at 18.90% examples, 1118141 words/s, in_qsize 20, out_qsize 0 2018-01-26 23:00:04,234 : INFO : PROGRESS: at 19.32% examples, 1120161 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:05,236 : INFO : PROGRESS: at 19.74% examples, 1121862 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:06,239 : INFO : PROGRESS: at 20.12% examples, 1122105 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:07,248 : INFO : PROGRESS: at 20.44% examples, 1118994 words/s, in_qsize 20, out_qsize 2 2018-01-26 23:00:08,265 : INFO : PROGRESS: at 20.78% examples, 1117950 words/s, in_qsize 16, out_qsize 3 2018-01-26 23:00:09,273 : INFO : PROGRESS: at 21.08% examples, 1116685 words/s, in_qsize 16, out_qsize 3 2018-01-26 23:00:10,274 : INFO : PROGRESS: at 21.37% examples, 1115527 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:11,277 : INFO : PROGRESS: at 21.71% examples, 1116702 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:12,279 : INFO : PROGRESS: at 22.05% examples, 1119204 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:13,280 : INFO : PROGRESS: at 22.39% examples, 1120948 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:14,303 : INFO : PROGRESS: at 22.84% examples, 1123029 words/s, in_qsize 16, out_qsize 3 2018-01-26 23:00:15,311 : INFO : PROGRESS: at 23.29% examples, 1125086 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:16,314 : INFO : PROGRESS: at 23.70% examples, 1126055 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:17,320 : INFO : PROGRESS: at 24.11% examples, 1126120 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:18,338 : INFO : PROGRESS: at 24.51% examples, 1125360 words/s, in_qsize 19, out_qsize 2 2018-01-26 23:00:19,341 : INFO : PROGRESS: at 24.89% examples, 1124872 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:20,341 : INFO : PROGRESS: at 25.27% examples, 1125014 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:21,345 : INFO : PROGRESS: at 25.70% examples, 1126757 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:22,352 : INFO : PROGRESS: at 26.13% examples, 1128409 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:23,354 : INFO : PROGRESS: at 26.56% examples, 1129534 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:24,354 : INFO : PROGRESS: at 26.98% examples, 1131275 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:25,357 : INFO : PROGRESS: at 27.43% examples, 1133695 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:26,360 : INFO : PROGRESS: at 27.83% examples, 1135633 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:00:27,366 : INFO : PROGRESS: at 28.21% examples, 1136048 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:00:28,366 : INFO : PROGRESS: at 28.60% examples, 1136512 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:29,368 : INFO : PROGRESS: at 29.00% examples, 1136449 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:30,375 : INFO : PROGRESS: at 29.40% examples, 1136749 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:31,377 : INFO : PROGRESS: at 29.84% examples, 1138380 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:32,377 : INFO : PROGRESS: at 30.24% examples, 1139294 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:33,380 : INFO : PROGRESS: at 30.64% examples, 1140315 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:34,382 : INFO : PROGRESS: at 31.01% examples, 1141493 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:35,389 : INFO : PROGRESS: at 31.32% examples, 1141333 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:36,390 : INFO : PROGRESS: at 31.65% examples, 1141471 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:00:37,394 : INFO : PROGRESS: at 31.97% examples, 1141766 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:38,399 : INFO : PROGRESS: at 32.31% examples, 1142421 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:39,402 : INFO : PROGRESS: at 32.65% examples, 1141909 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:00:40,407 : INFO : PROGRESS: at 33.03% examples, 1141084 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:41,410 : INFO : PROGRESS: at 33.42% examples, 1140488 words/s, in_qsize 18, out_qsize 2 2018-01-26 23:00:42,427 : INFO : PROGRESS: at 33.89% examples, 1142310 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:43,428 : INFO : PROGRESS: at 34.33% examples, 1143270 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:44,445 : INFO : PROGRESS: at 34.76% examples, 1143732 words/s, in_qsize 16, out_qsize 3 2018-01-26 23:00:45,449 : INFO : PROGRESS: at 35.20% examples, 1145069 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:46,454 : INFO : PROGRESS: at 35.64% examples, 1146706 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:47,466 : INFO : PROGRESS: at 36.09% examples, 1148127 words/s, in_qsize 20, out_qsize 2 2018-01-26 23:00:48,475 : INFO : PROGRESS: at 36.52% examples, 1148536 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:49,482 : INFO : PROGRESS: at 36.90% examples, 1148302 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:50,499 : INFO : PROGRESS: at 37.28% examples, 1148313 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:51,508 : INFO : PROGRESS: at 37.67% examples, 1148458 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:52,512 : INFO : PROGRESS: at 38.08% examples, 1149735 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:53,522 : INFO : PROGRESS: at 38.50% examples, 1151058 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:00:54,523 : INFO : PROGRESS: at 38.98% examples, 1152970 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:55,527 : INFO : PROGRESS: at 39.42% examples, 1154263 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:56,529 : INFO : PROGRESS: at 39.89% examples, 1156091 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:57,531 : INFO : PROGRESS: at 40.33% examples, 1157892 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:58,548 : INFO : PROGRESS: at 40.76% examples, 1159136 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:00:59,550 : INFO : PROGRESS: at 41.08% examples, 1158934 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:00,554 : INFO : PROGRESS: at 41.39% examples, 1158611 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:01,556 : INFO : PROGRESS: at 41.72% examples, 1158350 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:02,568 : INFO : PROGRESS: at 42.04% examples, 1158772 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:03,568 : INFO : PROGRESS: at 42.40% examples, 1160031 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:04,587 : INFO : PROGRESS: at 42.87% examples, 1161316 words/s, in_qsize 15, out_qsize 4 2018-01-26 23:01:05,606 : INFO : PROGRESS: at 43.36% examples, 1162900 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:06,609 : INFO : PROGRESS: at 43.83% examples, 1164620 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:07,616 : INFO : PROGRESS: at 44.28% examples, 1165258 words/s, in_qsize 16, out_qsize 3 2018-01-26 23:01:08,630 : INFO : PROGRESS: at 44.73% examples, 1166119 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:09,641 : INFO : PROGRESS: at 45.14% examples, 1166073 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:10,645 : INFO : PROGRESS: at 45.50% examples, 1165323 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:11,653 : INFO : PROGRESS: at 45.91% examples, 1165354 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:12,667 : INFO : PROGRESS: at 46.33% examples, 1165472 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:13,670 : INFO : PROGRESS: at 46.78% examples, 1166535 words/s, in_qsize 20, out_qsize 1 2018-01-26 23:01:14,670 : INFO : PROGRESS: at 47.20% examples, 1167533 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:15,672 : INFO : PROGRESS: at 47.64% examples, 1168791 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:16,675 : INFO : PROGRESS: at 48.05% examples, 1169673 words/s, in_qsize 20, out_qsize 0 2018-01-26 23:01:17,687 : INFO : PROGRESS: at 48.48% examples, 1170664 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:18,690 : INFO : PROGRESS: at 48.94% examples, 1171702 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:19,691 : INFO : PROGRESS: at 49.34% examples, 1171726 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:01:20,692 : INFO : PROGRESS: at 49.74% examples, 1171411 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:21,696 : INFO : PROGRESS: at 50.10% examples, 1170740 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:22,699 : INFO : PROGRESS: at 50.50% examples, 1170978 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:23,701 : INFO : PROGRESS: at 50.90% examples, 1171722 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:24,714 : INFO : PROGRESS: at 51.23% examples, 1172289 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:25,748 : INFO : PROGRESS: at 51.58% examples, 1171854 words/s, in_qsize 13, out_qsize 6 2018-01-26 23:01:26,752 : INFO : PROGRESS: at 51.94% examples, 1173142 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:27,771 : INFO : PROGRESS: at 52.29% examples, 1173536 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:28,775 : INFO : PROGRESS: at 52.65% examples, 1173393 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:29,793 : INFO : PROGRESS: at 53.13% examples, 1174267 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:30,800 : INFO : PROGRESS: at 53.54% examples, 1174321 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:31,807 : INFO : PROGRESS: at 53.88% examples, 1172658 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:32,821 : INFO : PROGRESS: at 54.30% examples, 1172652 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:01:33,835 : INFO : PROGRESS: at 54.73% examples, 1172702 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:34,841 : INFO : PROGRESS: at 55.15% examples, 1172959 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:35,849 : INFO : PROGRESS: at 55.58% examples, 1173463 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:36,858 : INFO : PROGRESS: at 56.04% examples, 1174529 words/s, in_qsize 20, out_qsize 3 2018-01-26 23:01:37,861 : INFO : PROGRESS: at 56.51% examples, 1175460 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:01:38,884 : INFO : PROGRESS: at 56.95% examples, 1176513 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:39,899 : INFO : PROGRESS: at 57.39% examples, 1177163 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:40,908 : INFO : PROGRESS: at 57.77% examples, 1177254 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:41,922 : INFO : PROGRESS: at 58.11% examples, 1176260 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:01:42,924 : INFO : PROGRESS: at 58.50% examples, 1176452 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:43,926 : INFO : PROGRESS: at 58.91% examples, 1176143 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:44,932 : INFO : PROGRESS: at 59.35% examples, 1176844 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:01:45,934 : INFO : PROGRESS: at 59.79% examples, 1177426 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:46,939 : INFO : PROGRESS: at 60.23% examples, 1178454 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:47,939 : INFO : PROGRESS: at 60.66% examples, 1179160 words/s, in_qsize 20, out_qsize 0 2018-01-26 23:01:48,949 : INFO : PROGRESS: at 61.05% examples, 1180077 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:49,967 : INFO : PROGRESS: at 61.42% examples, 1181033 words/s, in_qsize 16, out_qsize 3 2018-01-26 23:01:50,978 : INFO : PROGRESS: at 61.77% examples, 1181323 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:51,978 : INFO : PROGRESS: at 62.05% examples, 1180475 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:01:52,983 : INFO : PROGRESS: at 62.36% examples, 1180019 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:53,984 : INFO : PROGRESS: at 62.68% examples, 1178734 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:54,998 : INFO : PROGRESS: at 63.10% examples, 1178486 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:56,007 : INFO : PROGRESS: at 63.53% examples, 1178707 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:01:57,021 : INFO : PROGRESS: at 63.97% examples, 1179018 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:58,052 : INFO : PROGRESS: at 64.40% examples, 1178736 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:01:59,068 : INFO : PROGRESS: at 64.83% examples, 1178982 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:00,075 : INFO : PROGRESS: at 65.27% examples, 1179572 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:01,089 : INFO : PROGRESS: at 65.66% examples, 1179291 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:02,104 : INFO : PROGRESS: at 66.03% examples, 1178513 words/s, in_qsize 20, out_qsize 0 2018-01-26 23:02:03,108 : INFO : PROGRESS: at 66.42% examples, 1177893 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:04,120 : INFO : PROGRESS: at 66.79% examples, 1177431 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:05,138 : INFO : PROGRESS: at 67.16% examples, 1176997 words/s, in_qsize 20, out_qsize 2 2018-01-26 23:02:06,150 : INFO : PROGRESS: at 67.57% examples, 1177154 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:07,161 : INFO : PROGRESS: at 67.98% examples, 1177793 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:08,168 : INFO : PROGRESS: at 68.41% examples, 1178334 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:09,178 : INFO : PROGRESS: at 68.85% examples, 1178926 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:10,182 : INFO : PROGRESS: at 69.30% examples, 1179541 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:11,190 : INFO : PROGRESS: at 69.74% examples, 1180066 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:12,202 : INFO : PROGRESS: at 70.13% examples, 1179806 words/s, in_qsize 16, out_qsize 3 2018-01-26 23:02:13,213 : INFO : PROGRESS: at 70.45% examples, 1178756 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:02:14,217 : INFO : PROGRESS: at 70.80% examples, 1178227 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:15,227 : INFO : PROGRESS: at 71.09% examples, 1177247 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:16,230 : INFO : PROGRESS: at 71.40% examples, 1177003 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:17,236 : INFO : PROGRESS: at 71.77% examples, 1177543 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:18,239 : INFO : PROGRESS: at 72.13% examples, 1178429 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:19,240 : INFO : PROGRESS: at 72.49% examples, 1179005 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:20,240 : INFO : PROGRESS: at 72.95% examples, 1179486 words/s, in_qsize 20, out_qsize 0 2018-01-26 23:02:21,243 : INFO : PROGRESS: at 73.40% examples, 1179987 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:22,243 : INFO : PROGRESS: at 73.84% examples, 1180263 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:23,248 : INFO : PROGRESS: at 74.22% examples, 1179597 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:24,253 : INFO : PROGRESS: at 74.64% examples, 1179475 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:25,263 : INFO : PROGRESS: at 75.04% examples, 1179269 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:26,269 : INFO : PROGRESS: at 75.46% examples, 1179765 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:27,282 : INFO : PROGRESS: at 75.86% examples, 1179419 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:28,282 : INFO : PROGRESS: at 76.30% examples, 1179977 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:29,285 : INFO : PROGRESS: at 76.74% examples, 1180309 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:30,289 : INFO : PROGRESS: at 77.18% examples, 1181056 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:31,302 : INFO : PROGRESS: at 77.62% examples, 1181703 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:32,305 : INFO : PROGRESS: at 78.06% examples, 1182572 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:02:33,311 : INFO : PROGRESS: at 78.45% examples, 1182666 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:34,331 : INFO : PROGRESS: at 78.83% examples, 1182077 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:35,337 : INFO : PROGRESS: at 79.24% examples, 1181990 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:36,343 : INFO : PROGRESS: at 79.67% examples, 1182347 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:37,356 : INFO : PROGRESS: at 80.10% examples, 1182760 words/s, in_qsize 20, out_qsize 0 2018-01-26 23:02:38,361 : INFO : PROGRESS: at 80.54% examples, 1183431 words/s, in_qsize 20, out_qsize 0 2018-01-26 23:02:39,361 : INFO : PROGRESS: at 80.94% examples, 1183888 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:40,373 : INFO : PROGRESS: at 81.27% examples, 1184238 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:41,373 : INFO : PROGRESS: at 81.65% examples, 1184604 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:42,380 : INFO : PROGRESS: at 81.98% examples, 1184976 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:43,399 : INFO : PROGRESS: at 82.33% examples, 1185033 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:44,409 : INFO : PROGRESS: at 82.69% examples, 1184792 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:45,434 : INFO : PROGRESS: at 83.10% examples, 1184413 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:46,434 : INFO : PROGRESS: at 83.50% examples, 1184206 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:47,438 : INFO : PROGRESS: at 83.91% examples, 1184006 words/s, in_qsize 16, out_qsize 3 2018-01-26 23:02:48,443 : INFO : PROGRESS: at 84.40% examples, 1184735 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:49,447 : INFO : PROGRESS: at 84.85% examples, 1185192 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:50,455 : INFO : PROGRESS: at 85.30% examples, 1185843 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:51,466 : INFO : PROGRESS: at 85.76% examples, 1186478 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:52,471 : INFO : PROGRESS: at 86.21% examples, 1187013 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:53,482 : INFO : PROGRESS: at 86.66% examples, 1187400 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:54,490 : INFO : PROGRESS: at 87.06% examples, 1187322 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:55,491 : INFO : PROGRESS: at 87.43% examples, 1186901 words/s, in_qsize 20, out_qsize 0 2018-01-26 23:02:56,513 : INFO : PROGRESS: at 87.80% examples, 1186694 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:57,520 : INFO : PROGRESS: at 88.20% examples, 1186794 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:02:58,520 : INFO : PROGRESS: at 88.61% examples, 1187172 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:02:59,541 : INFO : PROGRESS: at 89.06% examples, 1187372 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:03:00,549 : INFO : PROGRESS: at 89.52% examples, 1188008 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:01,549 : INFO : PROGRESS: at 89.99% examples, 1188703 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:02,550 : INFO : PROGRESS: at 90.44% examples, 1189519 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:03,566 : INFO : PROGRESS: at 90.86% examples, 1189955 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:03:04,568 : INFO : PROGRESS: at 91.09% examples, 1188430 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:05,575 : INFO : PROGRESS: at 91.32% examples, 1186974 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:06,586 : INFO : PROGRESS: at 91.64% examples, 1186622 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:03:07,588 : INFO : PROGRESS: at 91.89% examples, 1185512 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:08,594 : INFO : PROGRESS: at 92.21% examples, 1185221 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:03:09,610 : INFO : PROGRESS: at 92.57% examples, 1185514 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:03:10,612 : INFO : PROGRESS: at 93.04% examples, 1185890 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:03:11,622 : INFO : PROGRESS: at 93.50% examples, 1186348 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:12,638 : INFO : PROGRESS: at 93.96% examples, 1186756 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:03:13,640 : INFO : PROGRESS: at 94.41% examples, 1186900 words/s, in_qsize 20, out_qsize 3 2018-01-26 23:03:14,651 : INFO : PROGRESS: at 94.81% examples, 1186711 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:03:15,653 : INFO : PROGRESS: at 95.17% examples, 1186049 words/s, in_qsize 16, out_qsize 3 2018-01-26 23:03:16,665 : INFO : PROGRESS: at 95.56% examples, 1185895 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:17,665 : INFO : PROGRESS: at 95.96% examples, 1185773 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:18,669 : INFO : PROGRESS: at 96.42% examples, 1186196 words/s, in_qsize 17, out_qsize 2 2018-01-26 23:03:19,670 : INFO : PROGRESS: at 96.85% examples, 1186535 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:20,674 : INFO : PROGRESS: at 97.28% examples, 1186987 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:03:21,675 : INFO : PROGRESS: at 97.70% examples, 1187448 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:22,678 : INFO : PROGRESS: at 98.13% examples, 1188036 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:23,679 : INFO : PROGRESS: at 98.57% examples, 1188570 words/s, in_qsize 18, out_qsize 1 2018-01-26 23:03:24,681 : INFO : PROGRESS: at 99.02% examples, 1188914 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:25,685 : INFO : PROGRESS: at 99.41% examples, 1188673 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:26,686 : INFO : PROGRESS: at 99.80% examples, 1188384 words/s, in_qsize 19, out_qsize 0 2018-01-26 23:03:27,164 : INFO : worker thread finished; awaiting finish of 9 more threads 2018-01-26 23:03:27,165 : INFO : worker thread finished; awaiting finish of 8 more threads 2018-01-26 23:03:27,173 : INFO : worker thread finished; awaiting finish of 7 more threads 2018-01-26 23:03:27,176 : INFO : worker thread finished; awaiting finish of 6 more threads 2018-01-26 23:03:27,191 : INFO : worker thread finished; awaiting finish of 5 more threads 2018-01-26 23:03:27,198 : INFO : worker thread finished; awaiting finish of 4 more threads 2018-01-26 23:03:27,200 : INFO : worker thread finished; awaiting finish of 3 more threads 2018-01-26 23:03:27,201 : INFO : worker thread finished; awaiting finish of 2 more threads 2018-01-26 23:03:27,206 : INFO : worker thread finished; awaiting finish of 1 more threads 2018-01-26 23:03:27,210 : INFO : worker thread finished; awaiting finish of 0 more threads 2018-01-26 23:03:27,211 : INFO : training on 415193550 raw words (303489491 effective words) took 255.4s, 1188218 effective words/s
303489491

Now, let's look at some output

This first example shows a simple case of looking up words similar to the word dirty. All we need to do here is to call the most_similar function and provide the word dirty as the positive example. This returns the top 10 similar words.

w1 = "dirty" model.wv.most_similar (positive=w1)
2018-01-26 23:03:42,416 : INFO : precomputing L2-norms of word weight vectors
[('filthy', 0.871721625328064), ('stained', 0.7922376990318298), ('unclean', 0.7915753126144409), ('dusty', 0.7772612571716309), ('smelly', 0.7618112564086914), ('grubby', 0.7483716011047363), ('dingy', 0.7330487966537476), ('gross', 0.7239381074905396), ('grimy', 0.7228356599807739), ('disgusting', 0.7213647365570068)]

That looks pretty good, right? Let's look at a few more. Let's look at similarity for polite, france and shocked.

# look up top 6 words similar to 'polite' w1 = ["polite"] model.wv.most_similar (positive=w1,topn=6)
[('courteous', 0.9174547791481018), ('friendly', 0.8309274911880493), ('cordial', 0.7990915179252625), ('professional', 0.7945970892906189), ('attentive', 0.7732747197151184), ('gracious', 0.7469891309738159)]
# look up top 6 words similar to 'france' w1 = ["france"] model.wv.most_similar (positive=w1,topn=6)
[('canada', 0.6603403091430664), ('germany', 0.6510637998580933), ('spain', 0.6431018114089966), ('barcelona', 0.61174076795578), ('mexico', 0.6070996522903442), ('rome', 0.6065913438796997)]
# look up top 6 words similar to 'shocked' w1 = ["shocked"] model.wv.most_similar (positive=w1,topn=6)
[('horrified', 0.80775386095047), ('amazed', 0.7797470092773438), ('astonished', 0.7748459577560425), ('dismayed', 0.7680633068084717), ('stunned', 0.7603034973144531), ('appalled', 0.7466776371002197)]

That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that relate to bed only:

# get everything related to stuff on the bed w1 = ["bed",'sheet','pillow'] w2 = ['couch'] model.wv.most_similar (positive=w1,negative=w2,topn=10)
[('duvet', 0.7086508274078369), ('blanket', 0.7016597390174866), ('mattress', 0.7002605199813843), ('quilt', 0.6868821978569031), ('matress', 0.6777950525283813), ('pillowcase', 0.6413239240646362), ('sheets', 0.6382123827934265), ('foam', 0.6322235465049744), ('pillows', 0.6320573687553406), ('comforter', 0.5972476601600647)]

Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary.

# similarity between two different words model.wv.similarity(w1="dirty",w2="smelly")
0.76181122646029453
# similarity between two identical words model.wv.similarity(w1="dirty",w2="dirty")
1.0000000000000002
# similarity between two unrelated words model.wv.similarity(w1="dirty",w2="clean")
0.25355593501920781

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that dirty is highly similar to smelly but dirty is dissimilar to clean. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring here.

Find the odd one out

You can even use Word2Vec to find odd items given a list of items.

# Which one is the odd one out in this list? model.wv.doesnt_match(["cat","dog","france"])
'france'
# Which one is the odd one out in this list? model.wv.doesnt_match(["bed","pillow","duvet","shower"])
'shower'

Understanding some of the parameters

To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)

size

The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me.

window

The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window.

min_count

Minimium frequency count of words. The model would ignore words that do not statisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

workers

How many threads to use behind the scenes?

When should you use Word2Vec?

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary.

Beyond, raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find tags that are related to a given tag and recommend the related ones for exploration. You can do this by treating each set of co-occuring tags as a "sentence" and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work.