Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

Asked 21/1, 2018 at 0:31 Answered 21/11, 2019 at 4:10

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:

inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires']) sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) rank = [docid for docid, sim in sims] print(rank)

Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.

Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?

Epexegesis answered 21/1, 2018 at 0:31 Comment(1)

The link to the tutorial was removed. – Arria 16/4, 2021 at 16:56

Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks @gojomo).

    def seeded_vector(self, seed_string):
        """Create one 'random' vector (but deterministic by seed_string)"""
        # Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
        once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
        return (once.rand(self.vector_size) - 0.5) / self.vector_size

Senn answered 21/1, 2018 at 18:13 Comment(9)

but then this basically means that on every call to infer_vector, I get a different result for the same query. It is like searching google for something and getting a completely different result everytime. – Epexegesis 21/1, 2018 at 22:52

Yes, but they shouldn't be too different. If they are, your model may be underpowered/overfit, or your inference not using appropriate steps and starting alpha. It's common for many-more steps and/or a lower starting alpha more like the default training alpha of 0.025 to work better for inference than the defaults, especially on short docs. – Kingsley 22/1, 2018 at 0:5

Note that it's not random start vector initialization that causes varied results per run – seeded_vector() function ensures identical starting vectors same seed_strings, & Doc2Vec.infer_vector() uses tokens you're inferring as seed_string in a deterministic way. Rather, it's other steps of the algorithm that inherently use random sampling of words, window sizes, or negative-examples. There are ways to force determinism in those steps, too, but that just hides the 'jitter' of the algorithm. It's better to ensure subsequent runs are 'similar enough' than 'artificially identical'. – Kingsley 22/1, 2018 at 0:11

so increasing the steps to 500 and decreasing the alpha and min_alpha significantly led to the convergence of one consistent result. However, the result was still way off and did not look similar at all. The library publishers do not provide any recommendation or when to use this. Probably it is not suitable for smaller text documents or smaller set of documents. – Epexegesis 22/1, 2018 at 4:1

@Kingsley thanks for pointing out my too fast conclusion. I looked through your posts on github (github.com/RaRe-Technologies/gensim/issues/447) related to infer_vector and I corrected my answer. – Senn 22/1, 2018 at 11:31

@Kingsley thank you, it works. I have very small documents and get very unstable results with the defaults. However with steps=200 and alpha=0.00025 I get almost the same results every time – Caryopsis 26/2, 2019 at 22:28

@Caryopsis An inference starting alpha=0.00025 is 100x smaller than a default/typical value - more like a typical final tiny value. If it works, great, but you might get as-good-or-better results with a somewhat-higher starting alpha, and fewer (and thus faster) steps. (Note also there have been some inference fixes in recent gensim releases – especially v3.5.0 of July 2018 – so be sure to upgrade if you're using anything older, and re-evaluate what values work best for your needs after upgrading.) – Kingsley 27/2, 2019 at 1:27

that makes sense, thanks. In practice, I tried greater alphas and less steps, but was still observing some instability. I'm working with gensim 3.2.0 though. I'll upgrade and tell you. Thanks for the tip – Caryopsis 27/2, 2019 at 8:16

same observations with the latest version – Caryopsis 6/3, 2019 at 16:23

Set negative=0 to avoid randomization:

import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20,  window=5, min_count=1,  negative=0, workers=6, epochs=10) 
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
    v1 = model.infer_vector(s)
    for i in range(100):
        v2 = model.infer_vector(s)
        assert np.all(v1 == v2), "Failed on %s" % (''.join(s))

Twist answered 21/11, 2019 at 4:10 Comment(1)

If I save the model and use it again in other code .. it's giving different result ?? Why is that happening? – Towrope 14/5, 2020 at 8:44

Recommended topics

Hot tags