What is the operation behind the word analogy in Word2vec?
Asked Answered
A

2

5

According to https://code.google.com/archive/p/word2vec/:

It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1]. You can try out a simple demo by running demo-analogy.sh.

So we can try from the supplied demo script:

+ ../bin/word-analogy ../data/text8-vector.bin
Enter three words (EXIT to break): paris france berlin

Word: paris  Position in vocabulary: 198365

Word: france  Position in vocabulary: 225534

Word: berlin  Position in vocabulary: 380477

                                              Word              Distance
------------------------------------------------------------------------
                                           germany      0.509434
                                          european      0.486505

Please note that paris france berlin is the input hint the demo suggest. The problem is that I'm unable to reproduce this behavior if I open the same word vectors in Gensim and try to compute the vectors myself. For example:

>>> word_vectors = KeyedVectors.load_word2vec_format(BIGDATA, binary=True)
>>> v = word_vectors['paris'] - word_vectors['france'] + word_vectors['berlin']
>>> word_vectors.most_similar(np.array([v]))
[('berlin', 0.7331711649894714), ('paris', 0.6669869422912598), ('kunst', 0.4056406617164612), ('inca', 0.4025722146034241), ('dubai', 0.3934606909751892), ('natalie_portman', 0.3909246325492859), ('joel', 0.3843030333518982), ('lil_kim', 0.3784593939781189), ('heidi', 0.3782389461994171), ('diy', 0.3767407238483429)]

So, what is the word analogy actually doing? How should I reproduce it?

Azaria answered 17/9, 2018 at 9:30 Comment(0)
R
5

You should be clear about exactly which word-vector set you're using: different sets will have a different ability to perform well on analogy tasks. (Those trained on the tiny text8 dataset might be pretty weak; the big GoogleNews set Google released would probably do well, at least under certain conditions like discarding low-frequnecy words.)

You're doing the wrong arithmetic for the analogy you're trying to solve. For an analogy "A is to B as C is to ?" often written as:

A : B :: C : _?_

You begin with 'B', subtract 'A', then add 'C'. So the example:

France : Paris :: Italy : _?_

...gives the formula in your excerpted text:

wv('Paris') - wv('France') + wv('Italy`) = target_coordinates  # close-to wv('Rome')

And to solve instead:

Paris : France :: Berlin : _?_

You would try:

wv('France') - wv('Paris') + wv('Berlin') = target_coordinates

...then see what's closest to target_coordinates. (Note the difference in operation-ordering to your attempt.)

You can think of it as:

  1. start at a country-vector ('France')
  2. subtract the (country&capital)-vector ('Paris'). This leaves you with an interim vector that's, sort-of, "zero" country-ness, and "negative" capital-ness.
  3. add another (country&capital)-vector ('Berlin'). This leaves you with a result vector that's, again sort-of, "one" country-ness, and "zero" capital-ness.

Note also that gensim's most_similar() takes multiple positive and negative word-examples, to do the arithmetic for you. So you can just do:

sims = word_vectors.most_similar(positive=['France', 'Berlin'], negative=['Paris'])
Ryannryazan answered 17/9, 2018 at 20:9 Comment(0)
C
4

It should be just element-wise addition and subtraction of vectors. And cosine distance to find the most similar ones. However, if you use original word2vec embeddings, there is difference between "paris" and "Paris" (strings were not lowered or lemmatised).

You may also try:

v = word_vectors['France'] - word_vectors['Paris'] + word_vectors['Berlin']

or

v = word_vectors['Paris'] - word_vectors['France'] + word_vectors['Germany']

because you should compare identical concepts (city - country + country -> another city)

Charmainecharmane answered 17/9, 2018 at 13:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.