Which method dm or dbow works well for document similarity using Doc2Vec?
Asked Answered
S

1

5

I'm trying to find out the similarity between 2 documents. I'm using Doc2vec Gensim to train around 10k documents. There are around 10 string type of tags. Each tag consists of a unique word and contains some sort of documents. Model is trained using distributed memory method.

Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=1)

I've tried both dm and dbow as well. dm gives better result(similarity score) as compared to dbow. I understood the concepts of dm vs dbow. But don't know which method is good for similarity measures between two documents.

First question: Which method is the best to perform well on similarities?

model.wv.n_similarity(<words_1>, <words_2>) gives similarity score using word vectors.

model.docvecs.similarity_unseen_docs(model, doc1, doc2) gives similarity score using doc vectors where doc1 and doc2 are not tags/ or indexes of doctags. Each doc1 and doc2 contains 10-20 words kind of sentences.

Both wv.n_similarity and docvecs.similarity_unseen_docs provide different similarity scores on same types of documents.

docvecs.similarity_unseen_docs gives little bit good results as compared to wv.n_similarity but wv.n_similarity sometimes also gives good results.

Question: What is the difference between docvecs.similarity_unseen_docs and wv.n_similarity? Can I use docvecs.similarity_unseen_docs to find the similarity score between unseen data (It might be a silly question)?

Why I asked because docvecs.similarity_unseen_docs provides similarity score on tags, not on actual words belonging to their tags. I'm not sure, please correct me here, if I'm wrong.

How can I convert cosine similarity score to probability?

Thanks.

model = Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=4)
# Training of the model
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(<list_of_list_of_tokens>)]
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

# Finding similarity score
model.wv.n_similarity(<doc_words1>, <doc_words2>)
model.random.seed(25)
model.docvecs.similarity_unseen_docs(model, <doc_words1>, <doc_words2>)
Schargel answered 27/5, 2019 at 9:34 Comment(0)
L
6

Both PV-DM mode (dm=1, the default) and PV-DBOW mode (dm=0) can work well. Which is better will depend on your data and goals. Once you have a robust way to quantitatively score the quality of a model's results, for your project goals – which you'll want to be able to tune all of the model's meta-parameters, including DM/DBOW mode – you can and should try both.

PV-DBOW trains fast, and often works very well on shortish-docs (a few dozens of words). Note, though, that this mode doesn't train usable word-vectors unless you also add the dbow_words=1 option, which will slow training.

Using model.wv.n_similarity() relies on word-vectors only. It averages each set f word-vectors, then reports the cosine-similarity between those two averages. (So, it will only be sensible in PV-DM mode, or PV-DBOW with dbow_words=1 activated.

Using model. docvecs.similarity_unseen_docs() uses infer_vector() to treat each of the supplied docs as new texts, for which a true Doc2Vec doc-vector (not a mere average-of-word-vectors) is calculated. (This method operates on lists-of-words, not lists-of-tags.)

Which is better is something you should test for your goals. The average-of-word-vectors is a simpler, faster technique for making a text-vector – but still works ok for a lot of purposes. The inferred doc-vectors take longer to calculate, but with a good model, may be better for some tasks.

Other notes on your setup:

  • often, setting min_count as low as 2 is a bad idea: those rare words don't have enough examples to mean much, and actually interfere with the quality of surrounding words

  • 10k documents is on the smallish side for a training corpus, compared to published Doc2Vec results (which usually use tens-of-thousands to millions of documents).

  • published results often use 10-20 training epochs (though more, like your choice of 50, might be helpful especially for smaller corpuses)

  • on typical multi-core machines workers=1 will be much slower than the default (workers=3); on a machine with 8 or more cores, up to workers=8 is often a good idea. (Though, unless using the newer corpus_file input option, more workers up to the full count of 16, 32, etc cores doesn't help.)

  • classic Doc2Vec usage doesn't assign docs just known labels (as in your "10 string type of tags"), but unique IDs for each document. In some cases using, or adding, known labels as tags may help, but beware that if you're only supplying 10 tags, you've essentially turned your 10,000 documents into 10 documents (from the perspective of the model's view, which sees all texts with the same tag as if they were segments of one larger document with that tag). In plain PV-DBOW, training only 10 doc-vectors, of 100-dimensions each, from just 10 distinct examples wouldn't make much sense: it'd be prone to severe overfitting. (In PV-DM or PV-DBOW with dbow_words, the fact that the model is training both 10 doc-vectors and many hundreds/thousands of other vocabulary-word word-vectors would help offset the risk of overfitting.)

Lovettalovich answered 27/5, 2019 at 18:47 Comment(6)
I've also tried giving workers=3 but it results in bad similarity score. Does workers value impact on the model?Schargel
The count of workers should have essentially no impact on model quality. If it seems to, there may be other problems with your comparison/training, in those runs, that have ben misattributed to varying workers.Lovettalovich
I've added the code to train of the model and to find the similarity score. I tried with python2 and python3 both. Similarity score gets changed when I change the value of the workers for both the versions of the python.Schargel
I feel, each time training the model(without changing parameters), gives different results. Even, passing seed parameter. Why is it so?Schargel
These algorithms use random-sampling at a number of steps – initialization, subsampling of training-words, sampling of negative-examples. And, using multiple threads will slightly jumble the training order on each run. So slight differences from run-to-run are to be expected. But the overall quality of results for downstream purposes should be generally the same each run – with just a little jitter in similarities/rankings/etc. If you're not seeing that, you should show an example of where two runs give very different results.Lovettalovich
(A longer explanation of the sources of differences from run-to-run is in the Gensim FAQ: github.com/RaRe-Technologies/gensim/wiki/…)Lovettalovich

© 2022 - 2024 — McMap. All rights reserved.