I'm trying to find out the similarity between 2 documents. I'm using Doc2vec Gensim to train around 10k documents. There are around 10 string type of tags. Each tag consists of a unique word and contains some sort of documents. Model is trained using distributed memory method.
Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=1)
I've tried both dm and dbow as well. dm gives better result(similarity score) as compared to dbow. I understood the concepts of dm vs dbow. But don't know which method is good for similarity measures between two documents.
First question: Which method is the best to perform well on similarities?
model.wv.n_similarity(<words_1>, <words_2>)
gives similarity score using word vectors.
model.docvecs.similarity_unseen_docs(model, doc1, doc2)
gives similarity score using doc vectors where doc1 and doc2 are not tags/ or indexes of doctags. Each doc1 and doc2 contains 10-20 words kind of sentences.
Both wv.n_similarity and docvecs.similarity_unseen_docs provide different similarity scores on same types of documents.
docvecs.similarity_unseen_docs gives little bit good results as compared to wv.n_similarity but wv.n_similarity sometimes also gives good results.
Question: What is the difference between docvecs.similarity_unseen_docs and wv.n_similarity? Can I use docvecs.similarity_unseen_docs to find the similarity score between unseen data (It might be a silly question)?
Why I asked because docvecs.similarity_unseen_docs provides similarity score on tags, not on actual words belonging to their tags. I'm not sure, please correct me here, if I'm wrong.
How can I convert cosine similarity score to probability?
Thanks.
model = Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=4)
# Training of the model
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(<list_of_list_of_tokens>)]
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
# Finding similarity score
model.wv.n_similarity(<doc_words1>, <doc_words2>)
model.random.seed(25)
model.docvecs.similarity_unseen_docs(model, <doc_words1>, <doc_words2>)
workers=3
but it results in bad similarity score. Does workers value impact on the model? – Schargel