fine tuning pre-trained word2vec Google News

Asked 15/9, 2017 at 16:48 Answered 18/4, 2019 at 17:24

python gensim word2vec google-news fasttext

I am currently using the Word2Vec model trained on Google News Corpus (from here) Since this is trained on news only until 2013, I need to updated the vectors and also add new words in the vocabulary based on the news coming after 2013.

Suppose I have a new corpus of news after 2013. Can I re-train or fine tune or update the Google News Word2Vec model? Can it be done using Gensim? Can it be done using FastText?

Jilli answered 15/9, 2017 at 16:48 Comment(0)

You can have a look at this: https://github.com/facebookresearch/fastText/pull/423

It does exactly the same thing you want: Here is what the link says:

Training the classification model or word vector model incrementally.

./fasttext [supervised | skipgram | cbow] -input train.data -inputModel trained.model.bin -output re-trained [other options] -incr

-incr stands for incremental training.

When training word embedding, one could do it from scratch with all data at each time, or just on the new data. For classification, one could train it from scratch with pre-trained word embedding with all data, or only the new one, with no changing of the word embedding.

Incremental training actually means, having finished training model with data we got before, and retrain the model with newer data we get, not from scratch.

Thiouracil answered 18/6, 2018 at 12:58 Comment(0)

Yes you can. I have been working on this too recently.

word2vec Reference
GloVe Reference

Edit: GloVe has a overhead of computing and storing co-occurence matrix in memory while training. Training word2vec is comparatively easy

Laris answered 18/4, 2019 at 17:24 Comment(0)

Recommended topics

Hot tags