I am using gensim doc2vec
. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.
Is there any way to get the vocabulary size from doc2vec model?
If model
is your trained Doc2Vec model, then the number of unique word tokens in the surviving vocabulary after applying your min_count
is available from:
len(model.wv.vocab)
The number of trained document tags is available from:
len(model.docvecs)
There is no such parameter as vocab. –
Unpin
Of course there was – at the time of writing! And still is – just in a different place! In more-recent versions of gensim, the
vocab
object has been moved to a constituent wv
property, and in 1.0.0, released February 2017 after this answer 1st written, a prior backward-compatibility ability to access vocab
via model.vocab
was removed. The answer above has been updated to match current gensim. –
Delgado Welp, I deleted my comment as yours and mine were the same. –
Unpin
The return data type of vocab is a dictionary. Use keys() as follows:
model.wv.vocab.keys()
This should return a list of words.
An update for gensim version 4. You can have the vocabulary size with:
vocab_len = len(model.wv) # 👍
See this Migrating to Gensim 4.0 page
© 2022 - 2024 — McMap. All rights reserved.