What does epochs mean in Doc2Vec and train when I have to manually run the iteration?
Asked Answered
P

1

8

I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function.

In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the Doc2Vec enough? Also how epochs in Doc2Vec is different from epochs in train?

documents = Documents(train_set)

model = Doc2Vec(vector_size=100, dbow_words=1, dm=0, epochs=4000,  window=5,
                seed=1337, min_count=5, workers=4, alpha=0.001, min_alpha=0.025)

model.build_vocab(documents)

for epoch in range(model.epochs):
    print("epoch "+str(epoch))
    model.train(documents, total_examples=total_length, epochs=1)
    ckpnt = model_name+"_epoch_"+str(epoch)
    model.save(ckpnt)
    print("Saving {}".format(ckpnt))

Also, how and when are the weights updated?

Picaresque answered 9/7, 2018 at 12:32 Comment(1)
@Downvoter It is so frustrating to have a downvote without a comment on how to improve the questionPicaresque
S
12

You don't have to manually run the iteration, and you shouldn't call train() more than once unless you're an expert who needs to do so for very specific reasons. If you've seen this technique in some online example you're copying, that example is likely outdated and misleading.

Call train() once, with your preferred number of passes as the epochs parameter.

Also, don't use a starting alpha learning-rate that is low (0.001) that then rises to a min_alpha value 25 times larger (0.025) - that's not how this is supposed to work, and most users shouldn't need to adjust the alpha-related defaults at all. (Again, if you're getting this from an online example somewhere - that's a bad example. Let them know they're giving bad advice.)

Also, 4000 training epochs is absurdly large. A value of 10-20 is common in published work, when dealing with tens-of-thousands to millions of documents. If your dataset is smaller, it may not work well with Doc2Vec, but sometimes more epochs (or smaller vector_size) can still learn something generalizable from tiny data - but still expect to use closer to dozens of epochs (not thousands).

A good intro (albeit with a tiny dataset that barely works with Doc2Vec) is the doc2vec-lee.ipynb Jupyter notebook that's bundled with gensim, and also viewable online at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

Good luck!

Seaweed answered 10/7, 2018 at 3:37 Comment(4)
Okay. When would one need to call train more than once?Picaresque
An advanced user who needed to do some mid-training logging or analysis or adjustment might split the training over multiple train() calls, and very consciously manage the effective alpha parameters for each call. An extremely advanced user experimenting with further training on an already-trained model might also try it, aware of all of the murky quality/balance issues that might involve. But essentially, unless you already know specifically why you'd need to do so, and the benefits and risks, it's a bad idea.Seaweed
Could you also explain the parameter vector_size? How to decide the value of this attribute?Picaresque
It's the size in dimensions of the word-vectors/doc-vectors that are created, and typical values range from 100 (the gensim default, for speed and memory-compactness) to 1000. Values of 300-400 seem especially common for word-vectors. The only way to know what's best for your data/goal is to search over different values, using a rigorous repeatable evaluation to score each option. Larger values will only make sense if you have a lot of data, and RAM, and time to train. (If working on toy-sized examples of a few hundred or few thousand texts, even 100-dimensions may be too much.)Seaweed

© 2022 - 2024 — McMap. All rights reserved.