How are word vectors co-trained with paragraph vectors in doc2vec DBOW?
Asked Answered
C

1

1

I don't understand how word vectors are involved at all in the training process with gensim's doc2vec in DBOW mode (dm=0). I know that it's disabled by default with dbow_words=0. But what happens when we set dbow_words to 1?

In my understanding of DBOW, the context words are predicted directly from the paragraph vectors. So the only parameters of the model are the N p-dimensional paragraph vectors plus the parameters of the classifier.

But multiple sources hint that it is possible in DBOW mode to co-train word and doc vectors. For instance:

So, how is this done? Any clarification would be much appreciated!

Note: for DM, the paragraph vectors are averaged/concatenated with the word vectors to predict the target words. In that case, it's clear that words vectors are trained simultaneously with document vectors. And there are N*p + M*q + classifier parameters (where M is vocab size and q word vector space dim).

Chickie answered 9/4, 2019 at 11:46 Comment(0)
T
2

If you set dbow_words=1, then skip-gram word-vector training is added the to training loop, interleaved with the normal PV-DBOW training.

So, for a given target word in a text, 1st the candidate doc-vector is used (alone) to try to predict that word, with backpropagation adjustments then occurring to the model & doc-vector. Then, a bunch of the surrounding words are each used, one at a time in skip-gram fashion, to try to predict that same target word – with the followup adjustments made.

Then, the next target word in the text gets the same PV-DBOW plus skip-gram treatment, and so on, and so on.

As some logical consequences of this:

  • training takes longer than plain PV-DBOW - by about a factor equal to the window parameter

  • word-vectors overall wind up getting more total training attention than doc-vectors, again by a factor equal to the window parameter

Tiros answered 9/4, 2019 at 16:28 Comment(7)
many thanks for the fast and helpful answer! (1) I understand that in this setting, word and doc vectors are indeed trained at the same time, but they don't interact. Hence, in terms of quality there is probably no improvement vs. training word and doc vectors separately? (2) I conclude that when dm=0 and dbow_words=0, word vectors are still created but never used/trained. Do you know by any chance how to get rid of them to reduce model size on disk and RAM?Chickie
elaborating on (1): I probably misunderstood something, but doesn't your explanation that word and doc vectors are trained simultaneously but without interacting contradict the results presented in this paper (section 5) that pre-training words vectors improves the quality of the dbow doc vectors? If there is no leak between the two tasks, this shouldn't change anything, no?Chickie
There's no supported way to discard the allocated, untrained word-vectors in the dbow_words=0 case. If you're done with both training and inference (which is also a kind of training), and only need to access trained-up doc-vectors, you could possibly del the associated d2v_model.wv property - but that might prevent other save()/load() operations from working, I'm not sure.Tiros
In dbow_words=1 mode, word-vectors are trained with some (context_word->target_word) pairs, then doc-vectors are trained with (doc_tag->target_word) pairs), then that's repeated in interleaved fashion. So no individual micro-training-examples involves both. But that's also the case between many words, in normal word training - but the words still wind up in useful relative positions. That's because all training examples share the same hidden->output layer of the neural network. Thus, the contrasting examples are each changing some shared parameters, and indirectly affect each other.Tiros
Whether adding dbow_words helps or hurts will be very specific to your data, goals, and meta-parameters. Whether seeding a Doc2Vec model with pre-trained word-vectors helps – an option for which there is no official gensim support – will depend on how well that pre-trained vocabulary suits your documents, and the model mode. And in dbow_words=0 mode, pre-loaded word-vectors can't have any effect, direct or indirect, on the doc-vectors - to the extent that paper suggests that, it is confused. (I also make this point at: groups.google.com/d/msg/gensim/4-pd0iA_xW4/UzpuvBOPAwAJ )Tiros
You can find more of my concerns about the specific claims/tests/gaps of that paper in some discussion at a project github issue – starting at github.com/RaRe-Technologies/gensim/issues/… – and in other discussion-group links from that issue.Tiros
thank you so much for your time providing such detailed explanations and useful links, it is greatly appreciated. Indeed, you're right about the indirect influence thing. I was not considering the fact that the projection->output matrix is shared both by word and doc vectors. Thanks again!Chickie

© 2022 - 2024 — McMap. All rights reserved.