In this page, it is said that:
[...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...]
However, looking at the training dataset it produces, the content of the X and Y pair seems to be interexchangeable, as those two pairs of (X, Y):
(quick, brown), (brown, quick)
So, why distinguish that much between context and targets if it is the same thing in the end?
Also, doing Udacity's Deep Learning course exercise on word2vec, I wonder why they seem to do the difference between those two approaches that much in this problem:
An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.
Would not this yields the same results?