Gensim Doc2Vec generating huge file for model [closed]
Asked Answered
O

1

5

I am trying to run doc2vec library from gensim package. My problem is that when I am training and saving the model the model file is rather large(2.5 GB) I tried using this line :

model.estimate_memory()

But it didn't change anything. I also have tried to change max_vocab_size to decrease the space. But there was not luck. Can somebody help me with this matter?

Outspan answered 19/7, 2017 at 15:37 Comment(3)
Nothing wrong here, document embeddings are just huge.Buck
I'm voting to close this question as off-topic because it's not a problem that can be resolved, just a misunderstanding of how the library being used works. 2.5G is already on the small side for this.Buck
I don't get the objection. Someone doesn't understand a programming task's resource requirements, which generates a question. Explaining the underlying operation of the algorithm/library can resolve the misunderstanding, and there are other coding options for achieving the underlying goals. These form a useful answer to a sufficiently-specified question.Habilitate
H
7

Doc2Vec models can be large. In particular, any word-vectors in use will use 4 bytes per dimension, times two layers of the model. So a 300-dimension model with a 200,000 word vocabulary will use just for the vectors array itself:

200,000 vectors * 300 dimensions * 4 bytes/float * 2 layers = 480MB

(There will be additional overhead for the dictionary storing vocabulary information.)

Any doc-vectors will also use 4 bytes per dimnsion. So if you train a vectors for a million doc-tags, the model will use just for the doc-vectors array:

1,000,000 vectors * 300 dimensions * 4 bytes/float = 2.4GB

(If you're using arbitrary string tags to name the doc-vectors, there'll be additional overhead for that.)

To use less memory when loaded (which will also result in a smaller store file), you can use a smaller vocabulary, train fewer doc-vecs, or use a smaller vector size.

If you'll only need the model for certain narrow purposes, there may be other parts you can throw out after training – but that requires knowledge of the model internals/source-code, and your specific needs, and will result in a model that's broken (and likely to throw errors) for many other usual operations.

Habilitate answered 19/7, 2017 at 17:48 Comment(5)
I currently try to generatea doc2vec model for 60k documents with around 200k sentences, projecting it into 50 dimensions. The corresponding word2vec model has 22mb, while the doc2vec splits into three files adding up to 12.6gb. This does not add up with your math at all. Is there another thing that I could be doing wrong?Inertia
@daniel-töws Are you providing tags during initial training that include plain-ints, but include high numbers way beyond just 0 to 200000? If you use any plain-ints, the assumption is you want those numbers as direct literal slots in the array, & an array large enough to include the largest int tag will be allocated – possibly wasting an arbitrary amount of space on vectors for not-used tags. Only use plain-int tags if you can give docs IDs from 0 that rise through contiguous numbers. Otherwise, string tags are better - & a dict mapping strings to internal slots is used automatically.Habilitate
I replaced my string IDs (for the paragraph vectors you mean, right?) with simple continous ints. It didnt change anythingInertia
It's hard to do this in the comments of someone else's question. I suggest posting a new question which shows: (1) the size of your corpus, in doc-count, raw-word-count, unique-word-count; (2) the code you use for training and saving the Doc2Vec & Word2Vec models on this corpus, with all model parameters; (3) the names/sizes of files on disk. But in the end, it's likely to be: if you have an oversized model on disk, it's because you've specified, in parameters & corpus, one with lots of unique words/tags and thus vectors.Habilitate
I went through my code again and it turns out I had a bug during the index phase for the tagged documents. Now it works like a charm. Sorry to bother you...Inertia

© 2022 - 2024 — McMap. All rights reserved.