Improving Gensim Doc2vec results
Asked Answered
G

1

9

I tried to apply doc2vec on 600000 rows of sentences: Code as below:

from gensim import models
model = models.Doc2Vec(alpha=.025, min_alpha=.025, min_count=1, workers = 5)
model.build_vocab(res)
token_count = sum([len(sentence) for sentence in res])
token_count

%%time
for epoch in range(100):
    #print ('iteration:'+str(epoch+1))
    #model.train(sentences)
    model.train(res, total_examples = token_count,epochs = model.iter)
    model.alpha -= 0.0001  # decrease the learning rate`
    model.min_alpha = model.alpha  # fix the learning rate, no decay

I am getting very poor results with the above implementation. the change I made apart from what was suggested in the tutorial was change the below line:

  model.train(sentences)

As:

 token_count = sum([len(sentence) for sentence in res])
model.train(res, total_examples = token_count,epochs = model.iter)
Gisser answered 19/12, 2017 at 15:20 Comment(0)
R
28

Unfortunately, your code's a nonsensical mix of misguided practices. so don't follow whatever online example you're following!

Taking the problems in order, top to bottom:

Don't make min_alpha the same as alpha. The stochastic-gradient-descent optimization process needs a gradual decline from a larger to smaller alpha learning-rate over the course of seeing many varied examples, and should generally end with a negligible near-zero value. (There are other problems with the code's attempt to explicitly decrement alpha in this way we'll get to below.) Only expert users who've already got a working setup, understand the algorithm well, and are performing experimental tweaks should be changing the alpha/min_alpha defaults.

Don't set min_count=1. Rare words that only appear once, or a few times, are generally not helpful for Word2Vec/Doc2Vec training. Their few occurrences mean their own corresponding model weights don't get much training, and the few occurrences are more likely to be unrepresentative compared to the corresponding words' true meaning (as might be reflected in test data or later production data). So the model's representations of these individual rare words are unlikely to become very good. But in total, all those rare words compete a lot with other words that do have a chance to become meaningful – so the 'rough' rare words are mainly random interference against other words. Or perhaps, those words mean extra model vocabulary parameters which help the model become superficially better on training data, due to memorizing non-generalizable idiosyncrasies there, but worse on future test/production data. So, min_count is another default (5) that should only be changed once you have a working baseline - and if you rigorously meta-optimize this parameter later, on a good-sized dataset (like your 600K docs), you're quite likely to find that a higher min_count rather than lower improves final results.

Why make a token_count? There's no later place where a total token-count is needed. The total_examples parameter later expects a count of the text examples – that is, number of individual documents/sentences – not total words. By supplying the (much-larger) word-count, train() wouldn't be able to manage alpha correctly or estimate progress in logged-output.

Don't call train() multiple times in a loop with your own explicit alpha management, unless you're positive you know what you're doing. Most people get it wrong. By supplying the default model.iter (which has a value of 5) as a parameter here, you're actually performing 500 total passes over your corpus, which is unlikely what you want. By decrementing an initial 0.025 alpha value by 0.0001 over 100 loops, you're winding up with a final alpha of 0.015 - less than half the starting value. Instead, call train() exactly once, with a correct total_examples, and a well-chosen epochs value (often 10 to 20 are used in Doc2Vec published work). Then it will do the exact right number of explicit iterations, and manage alpha intelligently, and print accurate progress estimation in logging.

Finally, this next thing isn't necessarily a problem in your code, because you don't show how your corpus res is constructed, but there is a common error to beware: make sure your corpus can be iterated over multiple times (as if it were an in-memory list, or a restartable iterable object over something coming from IO). Often people supply a single-use iterator, which after one pass through (as in the build_vocab()) returns nothing else - resulting in instant training and a uselessly-still-random-and-untrained model. (If you've enabled logging, and pay attention to logged output and timing of each step, it'll be obvious if this is a problem.)

Reclaim answered 19/12, 2017 at 17:37 Comment(5)
'res' is a list of tagged documents and corresponding tagsGisser
"and a well-chosen epochs value" -> what about shuffling? I have a loop running 50 times (==50 epochs?), which just eecutes shuffling + train (but I don't pass any epoch (==None)). Is shuffling required?Jerk
Shuffling between training epochs isn’t required – though if your corpus starts with some ordering patterns (like all docs of a certain type/topic) clumped together), one initial shuffle may help. Calling train() multiple times in your own loop is almost always unnecessary (& usually accompanied by other mistakes like improper alpha management).Reclaim
@Reclaim that was very helpful. Do you have any additional sources on hyperparameter tuning regarding Doc2Vec?Violette
No single source, sorry, my tips are scattered among various SO answers or gensim discussion list threads (groups.google.com/forum/#!forum/gensim).Reclaim

© 2022 - 2024 — McMap. All rights reserved.