Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors
Asked Answered
S

1

9

I'm training a Word2Vec model like:

model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)

and Doc2Vec model like:

doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)

with the same data and comparable parameters.

After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec embeddings of a document performs considerably better than using the doc2vec vectors. I also tried with much more doc2vec iterations (25, 80 and 150 - makes no difference).

Any tips or ideas why and how to improve doc2vec results?

Update: This is how doc2vec_tagged_documents is created:

doc2vec_tagged_documents = list()
counter = 0
for document in documents:
    doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
    counter += 1

Some more facts about my data:

  • My training data contains 4000 documents
  • with 900 words on average.
  • My vocabulary size is about 1000 words.
  • My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the doc2vec model like this, but it's almost the same result.
  • My data is not about natural language, please keep this in mind.
Supernumerary answered 21/7, 2017 at 9:40 Comment(0)
M
22

Summing/averaging word2vec vectors is often quite good!

It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)

If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0) as well. It'll train faster and is often a top-performer.

If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size may help.) But especially if window is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.

Other things that sometimes help improve Doc2Vec vectors for classification purposes:

  • re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector() defaults, such as infer_vector(tokens, steps=50, alpha=0.025) – while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training

  • where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument tags to be a list of tags

  • rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count.)

Hope this helps.

Machinery answered 21/7, 2017 at 23:58 Comment(3)
What is the advantage of doc2vec over averaging word vectors? Does doc2vec account for a word's surroundings in the sentence while building the vector from the test sentence? Because that's one place where the word2vec doesn't help.Compressor
Whether Doc2Vec works better than just averaging word vectors can depend on your corpus & goals. It's using similar inputs (word co-occurrences within context-windows or documents), & a similarly-sized predictive model (that generates the word or doc vectors), & similarly-sized text-representations (same number of dimensions), so scores on evaluations are likely to be in the same ballpark. In making each doc-vector predictive of all the words in each text, a doc-vector might model the text better than averages based on all those words' other occurrences.Machinery
But it can make sense to try/tune both, especially since the simple-average (or some sort of weighted-average) can be so easy to calculate as a baseline.Machinery

© 2022 - 2024 — McMap. All rights reserved.