According to https://code.google.com/archive/p/word2vec/:
It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1]. You can try out a simple demo by running demo-analogy.sh.
So we can try from the supplied demo script:
+ ../bin/word-analogy ../data/text8-vector.bin
Enter three words (EXIT to break): paris france berlin
Word: paris Position in vocabulary: 198365
Word: france Position in vocabulary: 225534
Word: berlin Position in vocabulary: 380477
Word Distance
------------------------------------------------------------------------
germany 0.509434
european 0.486505
Please note that paris france berlin
is the input hint the demo suggest. The problem is that I'm unable to reproduce this behavior if I open the same word vectors in Gensim
and try to compute the vectors myself. For example:
>>> word_vectors = KeyedVectors.load_word2vec_format(BIGDATA, binary=True)
>>> v = word_vectors['paris'] - word_vectors['france'] + word_vectors['berlin']
>>> word_vectors.most_similar(np.array([v]))
[('berlin', 0.7331711649894714), ('paris', 0.6669869422912598), ('kunst', 0.4056406617164612), ('inca', 0.4025722146034241), ('dubai', 0.3934606909751892), ('natalie_portman', 0.3909246325492859), ('joel', 0.3843030333518982), ('lil_kim', 0.3784593939781189), ('heidi', 0.3782389461994171), ('diy', 0.3767407238483429)]
So, what is the word analogy actually doing? How should I reproduce it?