I'm training a Word2Vec
model like:
model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)
and Doc2Vec
model like:
doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)
with the same data and comparable parameters.
After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec
embeddings of a document performs considerably better than using the doc2vec
vectors. I also tried with much more doc2vec
iterations (25, 80 and 150 - makes no difference).
Any tips or ideas why and how to improve doc2vec
results?
Update: This is how doc2vec_tagged_documents
is created:
doc2vec_tagged_documents = list()
counter = 0
for document in documents:
doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
counter += 1
Some more facts about my data:
- My training data contains 4000 documents
- with 900 words on average.
- My vocabulary size is about 1000 words.
- My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the
doc2vec
model like this, but it's almost the same result. - My data is not about natural language, please keep this in mind.