Back to the CMHC lab page

Word2vec Guessing Game

Welcome to the guessing game assignment. Think of a target word and give "clue" words below that describe the target. The word2vec model will give the 10 words closest to the semantic mean of the given clues. See if you can make the computer guess your chosen target word.

Things the model can do for you

If you separate the words by a space, comma, or + sign, the model will take the vector sum of the semantic representations of the given words. Finally, the model will normalize the vector to be of length 1, essentially turning the sum vector into a mean vector. Examples:

By separating words with a - sign, the model will take the vector difference of the semantic representations of the given words. Taken by itself, this difference vector is not that informative. However, when combined with summation you can make the model start at the representation of a word and then make the model move in a certain direction. For example: “king - man” will compute the difference vector (i.e. the direction) from man to king and “king - man + woman” will add this difference vector to the semantic representation of “woman”. In essence, this will compute “a man is to a king as a woman is to a …?” Examples:

Where does the data come from?

The corpus that was used to create the model is a selection of Google News documents (around 100 billion words), kindly provided by Google. The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in [2]. The model is available here: GoogleNews-vectors-negative300.bin.gz.

More information about the word2vec algorithm

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

Aalto University Logo