word2vec: CBOW & skip-gram performance wrt training dataset size

The question is simple. Which of the CBOW & skip-gram works better for a big dataset? (And the answer for small dataset follows.)

I am confused since, by Mikolov himself, [Link]

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.

CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

but, according to Google TensorFlow, [Link]

CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets.

However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets. We will focus on the skip-gram model in the rest of this tutorial.

Here is a Quora post which supports the first thought [Link], and then there is the other Quora post which suggests the second thought [Link]--both seem derivable from the aforementioned credible sources.

Or is it like what Mikolov said:

Overall, the best practice is to try few experiments and see what works the best for you, as different applications have different requirements.

But surely there is an empirical or analytical verdict or final saying on this matter?

When Mikolov meant CBOW works good for bigger dataset and SG for smaller dataset, I suppose the quantity of data is considered. Since CBOW considers one target word and many context words, it needs a bigger dataset to train for target vectors compared to datasets used in SG. As in vice versa, in SG due to many target words for single context word, it needs smaller datasets.

Google Tensor Flow speaks about the distribution of words in the dataset for generating quality vectors rather than the quantity of dataset used. As CBOW model considers more over the same context words for all the target words in a sentence, a bigger (distributed) dataset is needed and vice versa for SG.

In common, they both mean the same:

CBOW model = dataset with short sentences but high number of samples (bigger dataset)
SG model = dataset with long sentences and low number of samples (smaller dataset)

Recommended topics

Hot tags