Is there any way to get the vocabulary size from doc2vec model?
Asked Answered
H

3

8

I am using gensim doc2vec. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.

Hiccup answered 12/1, 2017 at 8:7 Comment(0)
D
14

If model is your trained Doc2Vec model, then the number of unique word tokens in the surviving vocabulary after applying your min_count is available from:

len(model.wv.vocab)

The number of trained document tags is available from:

len(model.docvecs)
Delgado answered 19/1, 2017 at 0:29 Comment(3)
There is no such parameter as vocab.Unpin
Of course there was – at the time of writing! And still is – just in a different place! In more-recent versions of gensim, the vocab object has been moved to a constituent wv property, and in 1.0.0, released February 2017 after this answer 1st written, a prior backward-compatibility ability to access vocab via model.vocab was removed. The answer above has been updated to match current gensim.Delgado
Welp, I deleted my comment as yours and mine were the same.Unpin
I
4

The return data type of vocab is a dictionary. Use keys() as follows:

model.wv.vocab.keys()

This should return a list of words.

Icecold answered 7/5, 2019 at 11:24 Comment(0)
C
1

An update for gensim version 4. You can have the vocabulary size with:

vocab_len = len(model.wv)  # 👍

See this Migrating to Gensim 4.0 page

Cysto answered 20/11, 2021 at 15:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.