For gensim's Doc2Vec, your text examples must be objects similar to the example TaggedDocument
class: with words
and tags
properties. The tags
property should be a list of 'tags', which serve as keys to the doc-vectors that will be learned from the corresponding text.
In the classic/original case, each document has a single tag – essentially a unique ID for that one document. (Tags can be strings, but for very large corpuses, Doc2Vec will use somewhat less memory if you instead use tags that are plain Python ints, starting from 0, with no skipped values.)
The tags are used to look-up the learned vectors after training. If you had a document during training with the single tag 'mars'
, you'd look-up the learned vector with:
model.docvecs['mars']
If you were do a model.docvecs.most_similar['mars']
call, the results will be reported by their tag keys, as well.
The tags are just keys into the doc-vectors collection – they have no semantic meaning, and even if a string is repeated from the word-tokens in the text, there's no necessary relation between this tag key and the word.
That is, if you have a document whose single ID tag is 'mars', there's no essential relationship between the learned doc-vector accessed via that key (model.docvecs['mars']
), and any learned word-vector accessed with the same string key (model.wv['mars']
) – they're coming from separate collections-of-vectors.