Which method dm or dbow works well for document similarity using Doc2Vec?

model = Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=4) # Training of the model tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(<list_of_list_of_tokens>)] model.build_vocab(tagged_data) model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs) # Finding similarity score model.wv.n_similarity(<doc_words1>, <doc_words2>) model.random.seed(25) model.docvecs.similarity_unseen_docs(model, <doc_words1>, <doc_words2>)

Both PV-DM mode (dm=1, the default) and PV-DBOW mode (dm=0) can work well. Which is better will depend on your data and goals. Once you have a robust way to quantitatively score the quality of a model's results, for your project goals – which you'll want to be able to tune all of the model's meta-parameters, including DM/DBOW mode – you can and should try both.

PV-DBOW trains fast, and often works very well on shortish-docs (a few dozens of words). Note, though, that this mode doesn't train usable word-vectors unless you also add the dbow_words=1 option, which will slow training.

Using model.wv.n_similarity() relies on word-vectors only. It averages each set f word-vectors, then reports the cosine-similarity between those two averages. (So, it will only be sensible in PV-DM mode, or PV-DBOW with dbow_words=1 activated.

Using model. docvecs.similarity_unseen_docs() uses infer_vector() to treat each of the supplied docs as new texts, for which a true Doc2Vec doc-vector (not a mere average-of-word-vectors) is calculated. (This method operates on lists-of-words, not lists-of-tags.)

Which is better is something you should test for your goals. The average-of-word-vectors is a simpler, faster technique for making a text-vector – but still works ok for a lot of purposes. The inferred doc-vectors take longer to calculate, but with a good model, may be better for some tasks.

Other notes on your setup:

often, setting min_count as low as 2 is a bad idea: those rare words don't have enough examples to mean much, and actually interfere with the quality of surrounding words
10k documents is on the smallish side for a training corpus, compared to published Doc2Vec results (which usually use tens-of-thousands to millions of documents).
published results often use 10-20 training epochs (though more, like your choice of 50, might be helpful especially for smaller corpuses)
on typical multi-core machines workers=1 will be much slower than the default (workers=3); on a machine with 8 or more cores, up to workers=8 is often a good idea. (Though, unless using the newer corpus_file input option, more workers up to the full count of 16, 32, etc cores doesn't help.)
classic Doc2Vec usage doesn't assign docs just known labels (as in your "10 string type of tags"), but unique IDs for each document. In some cases using, or adding, known labels as tags may help, but beware that if you're only supplying 10 tags, you've essentially turned your 10,000 documents into 10 documents (from the perspective of the model's view, which sees all texts with the same tag as if they were segments of one larger document with that tag). In plain PV-DBOW, training only 10 doc-vectors, of 100-dimensions each, from just 10 distinct examples wouldn't make much sense: it'd be prone to severe overfitting. (In PV-DM or PV-DBOW with dbow_words, the fact that the model is training both 10 doc-vectors and many hundreds/thousands of other vocabulary-word word-vectors would help offset the risk of overfitting.)

Recommended topics

Hot tags