How does doc2vec perform when trained on different sized datasets? There is no mention of dataset size in the original corpus, so I am wondering what is the minimum size required to get good performance out of doc2vec.
A bunch of things have been called 'doc2vec', but it seems to most-often refer to the 'Paragraph Vector' technique from Le and Mikolov.
The original 'Paragraph Vector' paper describes evaluating it on three datasets:
- 'Stanford Sentiment Treebank': 11,825 sentences of movie-reviews (which were further broken into 239,232 fragment-phrases of a few words each)
- 'IMDB Dataset': 100,000 movie-reviews (often of a few hundred words each)
- Search-result 'snippet' paragraphs: 10,000,000 paragraphs, collected from the top-10 Google search results for each of the top 1,000,000 most-common queries
The 1st two are publicly available, so you can also review their total sizes in words, typical document sizes, and vocabularies. (Note, though, that no one has been able to fully-reproduce that paper's sentiment-classification results on either of those first two datasets, implying some missing info or error in their reporting. It's possible to get close on the IMDB dataset.)
A followup paper applied the algorithm to discovering topical-relationships in the datasets:
- Wikipedia: 4,490,000 article body-texts
- Arxiv: 886,000 academic-paper texts extracted from PDFs
So the corpuses used in those two early papers ranged from tens-of-thousands to millions of documents, and document sizes from a few word phrases to thousands-of-word articles. (But those works did not necessarily mix wildly-differently-sized documents.)
In general, word2vec/paragraph-vector techniques benefit from a lot of data and variety of word-contexts. I wouldn't expect good results without at least tens-of-thousands of documents. Documents longer than a few words each work much better. Results may be harder to interpret if wildly-different-in-size or -kind documents are mixed in the same training – such as mixing tweets and books.
But you really have to evaluate it with your corpus and goals, because what works with some data, for some purposes, may not be generalizable to very-different projects.
© 2022 - 2024 — McMap. All rights reserved.