Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model
Asked Answered
E

2

12

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:

inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires']) sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) rank = [docid for docid, sim in sims] print(rank)

Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.

Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?

Epexegesis answered 21/1, 2018 at 0:31 Comment(1)
The link to the tutorial was removed.Arria
S
16

Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks @gojomo).

    def seeded_vector(self, seed_string):
        """Create one 'random' vector (but deterministic by seed_string)"""
        # Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
        once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
        return (once.rand(self.vector_size) - 0.5) / self.vector_size
Senn answered 21/1, 2018 at 18:13 Comment(9)
but then this basically means that on every call to infer_vector, I get a different result for the same query. It is like searching google for something and getting a completely different result everytime.Epexegesis
Yes, but they shouldn't be too different. If they are, your model may be underpowered/overfit, or your inference not using appropriate steps and starting alpha. It's common for many-more steps and/or a lower starting alpha more like the default training alpha of 0.025 to work better for inference than the defaults, especially on short docs.Kingsley
Note that it's not random start vector initialization that causes varied results per run – seeded_vector() function ensures identical starting vectors same seed_strings, & Doc2Vec.infer_vector() uses tokens you're inferring as seed_string in a deterministic way. Rather, it's other steps of the algorithm that inherently use random sampling of words, window sizes, or negative-examples. There are ways to force determinism in those steps, too, but that just hides the 'jitter' of the algorithm. It's better to ensure subsequent runs are 'similar enough' than 'artificially identical'.Kingsley
so increasing the steps to 500 and decreasing the alpha and min_alpha significantly led to the convergence of one consistent result. However, the result was still way off and did not look similar at all. The library publishers do not provide any recommendation or when to use this. Probably it is not suitable for smaller text documents or smaller set of documents.Epexegesis
@Kingsley thanks for pointing out my too fast conclusion. I looked through your posts on github (github.com/RaRe-Technologies/gensim/issues/447) related to infer_vector and I corrected my answer.Senn
@Kingsley thank you, it works. I have very small documents and get very unstable results with the defaults. However with steps=200 and alpha=0.00025 I get almost the same results every timeCaryopsis
@Caryopsis An inference starting alpha=0.00025 is 100x smaller than a default/typical value - more like a typical final tiny value. If it works, great, but you might get as-good-or-better results with a somewhat-higher starting alpha, and fewer (and thus faster) steps. (Note also there have been some inference fixes in recent gensim releases – especially v3.5.0 of July 2018 – so be sure to upgrade if you're using anything older, and re-evaluate what values work best for your needs after upgrading.)Kingsley
that makes sense, thanks. In practice, I tried greater alphas and less steps, but was still observing some instability. I'm working with gensim 3.2.0 though. I'll upgrade and tell you. Thanks for the tipCaryopsis
same observations with the latest versionCaryopsis
T
2

Set negative=0 to avoid randomization:

import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20,  window=5, min_count=1,  negative=0, workers=6, epochs=10) 
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
    v1 = model.infer_vector(s)
    for i in range(100):
        v2 = model.infer_vector(s)
        assert np.all(v1 == v2), "Failed on %s" % (''.join(s))
Twist answered 21/11, 2019 at 4:10 Comment(1)
If I save the model and use it again in other code .. it's giving different result ?? Why is that happening?Towrope

© 2022 - 2024 — McMap. All rights reserved.