How to use the infer_vector in gensim.doc2vec?
Asked Answered
S

1

10
def cosine(vector1,vector2):
    cosV12 = np.dot(vector1, vector2) / (linalg.norm(vector1) * linalg.norm(vector2))
    return cosV12
model=gensim.models.doc2vec.Doc2Vec.load('Model_D2V_Game')
string='民生 为了 父亲 我 要 坚强 地 ...'
list=string.split(' ')
vector1=model.infer_vector(doc_words=list,alpha=0.1, min_alpha=0.0001,steps=5)
vector2=model.docvecs.doctag_syn0[0]
print cosine(vector2,vector1)

-0.0232586

I use a train data to train a doc2vec model. Then, I use infer_vector() to generate a vector given a document which is in trained data. But they are different. The value of cosine was so small (-0.0232586) distance between the vector2 which was saved in doc2vec model and the vector1 which was generated by infer_vector(). But this is not reasonable ah ...

I find where i have error in. I should use 'string=u'民生 为了 父亲 我 要 坚强 地 ...'' instead 'string='民生 为了 父亲 我 要 坚强 地 ...''. When I correct this way, the cosine distance is up to 0.889342.

Subirrigate answered 9/7, 2017 at 5:19 Comment(3)
What cosine are you using? There's a cosine function in gensim.Pralltriller
@cᴏʟᴅsᴘᴇᴇᴅ ,The cosine function is def cosine(vector1,vector2): cosV12 = np.dot(vector1, vector2) / (linalg.norm(vector1) * linalg.norm(vector2)) return cosV12Subirrigate
I find where i have error in. I should use 'string=u'民生 为了 父亲 我 要 坚强 地 ...'' instead 'string='民生 为了 父亲 我 要 坚强 地 ...''. When I correct this way, the cosine distance is up to 0.889342.Subirrigate
L
24

As you've noticed, infer_vector() requires its doc_words argument to be a list of tokens – matching the same kind of tokenization that was used in training the model. (Passing it a string causes it to just see each individual character as an item in a tokenized list, and even if a few of the tokens are known vocabulary tokens – as with 'a' and 'I' in English – you're unlikely to get good results.)

Additionally, the default parameters of infer_vector() may be far from optimal for many models. In particular, a larger steps (at least as large as the number of model training iterations, but perhaps even many times larger) is often helpful. Also, a smaller starting alpha, perhaps just the common default for bulk training of 0.025, may give better results.

Your test of whether inference gets a vector close to the same vector from bulk-training is a reasonable sanity-check, on both your inference parameters and the earlier training – is the model as a whole learning generalizable patterns in the data? But because most modes of Doc2Vec inherently use randomness, or (during bulk training) can be affected by the randomness introduced by multiple-thread scheduling jitter, you shouldn't expect identical results. They'll just get generally closer, the more training iterations/steps you do.

Finally, note that the most_similar() method on Doc2Vec's docvecs component can also take a raw vector, to give back a list of most-similar already-known vectors. So you can try the following...

ivec = model.infer_vector(doc_words=tokens_list, steps=20, alpha=0.025)
print(model.most_similar(positive=[ivec], topn=10))

...and get a ranked list of the top-10 most-similar (doctag, similarity_score) pairs.

Lobscouse answered 9/7, 2017 at 16:15 Comment(3)
Your answer is perfect and detailed. I will think carefully about your answer. Thanks for everything you do. @LobscouseSubirrigate
I use the trained vectors data(retrieved from well-trained doc2vec) to train a neural networks, so i hope a expect identical results from infer_vector() in order to adapt to neural networks. @LobscouseSubirrigate
It shouldn't be strictly necessary for the outputs of Doc2Vec to be identical for identical inference texts, to train a downstream neural-net. (In fact, it might even be possible that the 'jitter' in alternate Doc2Vec-vectorizations of a single text could help communicate the imprecise-ranges of a text's meaning to a downstream NN, improving the total model. But not sure.)Lobscouse

© 2022 - 2024 — McMap. All rights reserved.