The question is simple. Which of the CBOW & skip-gram works better for a big dataset? (And the answer for small dataset follows.)
I am confused since, by Mikolov himself, [Link]
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
but, according to Google TensorFlow, [Link]
CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets.
However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets. We will focus on the skip-gram model in the rest of this tutorial.
Here is a Quora post which supports the first thought [Link], and then there is the other Quora post which suggests the second thought [Link]--both seem derivable from the aforementioned credible sources.
Or is it like what Mikolov said:
Overall, the best practice is to try few experiments and see what works the best for you, as different applications have different requirements.
But surely there is an empirical or analytical verdict or final saying on this matter?