How to use Gensim doc2vec with pre-trained word vectors?
Asked Answered
P

4

46

I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec?

Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector training?

Thanks.

Pruchno answered 14/12, 2014 at 15:13 Comment(0)
S
25

Note that the "DBOW" (dm=0) training mode doesn't require or even create word-vectors as part of the training. It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram training mode).

(Before gensim 0.12.0, there was the parameter train_words mentioned in another comment, which some documentation suggested will co-train words. However, I don't believe this ever actually worked. Starting in gensim 0.12.0, there is the parameter dbow_words, which works to skip-gram train words simultaneous with DBOW doc-vectors. Note that this makes training take longer – by a factor related to window. So if you don't need word-vectors, you may still leave this off.)

In the "DM" training method (dm=1), word-vectors are inherently trained during the process along with doc-vectors, and are likely to also affect the quality of the doc-vectors. It's theoretically possible to pre-initialize the word-vectors from prior data. But I don't know any strong theoretical or experimental reason to be confident this would improve the doc-vectors.

One fragmentary experiment I ran along these lines suggested the doc-vector training got off to a faster start – better predictive qualities after the first few passes – but this advantage faded with more passes. Whether you hold the word vectors constant or let them continue to adjust with the new training is also likely an important consideration... but which choice is better may depend on your goals, data set, and the quality/relevance of the pre-existing word-vectors.

(You could repeat my experiment with the intersect_word2vec_format() method available in gensim 0.12.0, and try different levels of making pre-loaded vectors resistant-to-new-training via the syn0_lockf values. But remember this is experimental territory: the basic doc2vec results don't rely on, or even necessarily improve with, reused word vectors.)

Stellarator answered 19/5, 2015 at 22:19 Comment(4)
I think that pre-trained word vectors could improve results obtained with DM=1 especially on small datasets as there would already be a strong relation between similar words. So if I have the sentence "I like hot chocolate" in my training data, but the word "warm" never appears there, then my model would not know what vector to assign "I like warm chocolate" if I do not pre-initialize the vectors, right? But it would benefit from word2vec initialization an assign similar vectors to the two sentences. This is just a guess and I do not know if it is true.Ferrel
I also do not know how doc2vec handles unseen words. So assume that I have some words X seen in training and then I infer vectors for documents that also contain unseen words Y. Will these be assigned random vectors or just be ignored? Would the behaviour be different when I pre-initialize with word2vec as there are much more words in the corpus? Or can I only pre-initialize word vectors that will also be encountered during training the doc2vec model?Ferrel
Words not present in the original corpus (or that don't survive other vocab trimming, such as by min_count) are ignored during training/inference - so passing them to infer_vector() has no effect. (If you pass a text with all unknown words, it's like passing an empty list.) There's no supported way to pre-initialize Doc2Vec with other word-vectors; and even the experimental intersect_word2vec_format() method I mention in the above 3-year-old answer: (1) has broken in recent versions for Doc2Vec; (2) only left the vocabulary with words in whatever initialization corpus you provided.Stellarator
Hi @Stellarator Despite your hints, I don't understand conceptually how the co-training of word and paragraph vectors is possible in DBOW mode. I've created a question for this. Would you please have a look: #55592642Correspondent
A
12

Well, I am recently using Doc2Vec too. And I was thinking of using LDA result as word vector and fix those word vectors to get a document vector. The result isn't very interesting though. Maybe it's just my data set isn't that good. The code is below. Doc2Vec saves word vectors and document vectors together in dictionary doc2vecmodel.syn0. You can direct change the vector values. The only problem may be that you need to find out which position in syn0 represents which word or document. The vectors are stored in random order in dictionary syn0.

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models, similarities
import gensim
from sklearn import svm, metrics
import numpy

#Read in texts into div_texts(for LDA and Doc2Vec)
div_texts = []
f = open("clean_ad_nonad.txt")
lines = f.readlines()
f.close()
for line in lines:
    div_texts.append(line.strip().split(" "))

#Set up dictionary and MMcorpus
dictionary = corpora.Dictionary(div_texts)
dictionary.save("ad_nonad_lda_deeplearning.dict")
#dictionary = corpora.Dictionary.load("ad_nonad_lda_deeplearning.dict")
print dictionary.token2id["junk"]
corpus = [dictionary.doc2bow(text) for text in div_texts]
corpora.MmCorpus.serialize("ad_nonad_lda_deeplearning.mm", corpus)

#LDA training
id2token = {}
token2id = dictionary.token2id
for onemap in dictionary.token2id:
    id2token[token2id[onemap]] = onemap
#ldamodel = models.LdaModel(corpus, num_topics = 100, passes = 1000, id2word = id2token)
#ldamodel.save("ldamodel1000pass.lda")
#ldamodel = models.LdaModel(corpus, num_topics = 100, id2word = id2token)
ldamodel = models.LdaModel.load("ldamodel1000pass.lda")
ldatopics = ldamodel.show_topics(num_topics = 100, num_words = len(dictionary), formatted = False)
print ldatopics[10][1]
print ldatopics[10][1][1]
ldawordindex = {}
for i in range(len(dictionary)):
    ldawordindex[ldatopics[0][i][1]] = i

#Doc2Vec initialize
sentences = []
for i in range(len(div_texts)):
    string = "SENT_" + str(i)
    sentence = models.doc2vec.LabeledSentence(div_texts[i], labels = [string])
    sentences.append(sentence)
doc2vecmodel = models.Doc2Vec(sentences, size = 100, window = 5, min_count = 0, dm = 1)
print "Initial word vector for word junk:"
print doc2vecmodel["junk"]

#Replace the word vector with word vectors from LDA
print len(doc2vecmodel.syn0)
index2wordcollection = doc2vecmodel.index2word
print index2wordcollection
for i in range(len(doc2vecmodel.syn0)):
    if index2wordcollection[i].startswith("SENT_"):
        continue
    wordindex = ldawordindex[index2wordcollection[i]]
    wordvectorfromlda = [ldatopics[j][wordindex][0] for j in range(100)]
    doc2vecmodel.syn0[i] = wordvectorfromlda
#print doc2vecmodel.index2word[26841]
#doc2vecmodel.syn0[0] = [0 for i in range(100)]
print "Changed word vector for word junk:"
print doc2vecmodel["junk"]

#Train Doc2Vec
doc2vecmodel.train_words = False 
print "Initial doc vector for 1st document"
print doc2vecmodel["SENT_0"]
for i in range(50):
    print "Round: " + str(i)
    doc2vecmodel.train(sentences)
print "Trained doc vector for 1st document"
print doc2vecmodel["SENT_0"]

#Using SVM to do classification
resultlist = []
for i in range(4143):
    string = "SENT_" + str(i)
    resultlist.append(doc2vecmodel[string])
svm_x_train = []
for i in range(1000):
    svm_x_train.append(resultlist[i])
for i in range(2210,3210):
    svm_x_train.append(resultlist[i])
print len(svm_x_train)

svm_x_test = []
for i in range(1000,2210):
    svm_x_test.append(resultlist[i])
for i in range(3210,4143):
    svm_x_test.append(resultlist[i])
print len(svm_x_test)

svm_y_train = numpy.array([0 for i in range(2000)])
for i in range(1000,2000):
    svm_y_train[i] = 1
print svm_y_train

svm_y_test = numpy.array([0 for i in range(2143)])
for i in range(1210,2143):
    svm_y_test[i] = 1
print svm_y_test


svc = svm.SVC(kernel='linear')
svc.fit(svm_x_train, svm_y_train)

expected = svm_y_test
predicted = svc.predict(svm_x_test)

print("Classification report for classifier %s:\n%s\n"
      % (svc, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

print doc2vecmodel["junk"]
Ampereturn answered 30/12, 2014 at 5:3 Comment(0)
D
12

This forked version of gensim allows loading pre-trained word vectors for training doc2vec. Here you have an example on how to use it. The word vectors must be in the C-word2vec tool text format: one line per word vector where first comes a string representing the word and then space-separated float values, one for each dimension of the embedding.

This work belongs to a paper in which they claim that using pre-trained word embeddings actually helps building the document vectors. However I am getting almost the same results no matter I load the pre-trained embeddings or not.

Edit: actually there is one remarkable difference in my experiments. When I loaded the pretrained embeddings I trained doc2vec for half of the iterations to get almost the same results (training longer than that produced worse results in my task).

Dartmouth answered 5/9, 2016 at 20:53 Comment(0)
M
2

Radim just posted a tutorial on the doc2vec features of gensim (yesterday, I believe - your question is timely!).

Gensim supports loading pre-trained vectors from the C implementation, as described in the gensim models.word2vec API documentation.

Marleenmarlen answered 16/12, 2014 at 19:46 Comment(3)
Thanks Aaron.. Indeed timely question :) This is written in the tutorial: "...if you only wish to learn representations for labels and leave the word representations fixed, the model also has the flag train_words=False".. I know that you can use pre-trained vectors for word2vec.. The question is, how do I call doc2vec with those pre-trained vectors?Pruchno
@Stergios: Maybe I'm misunderstanding the question (and I'm still stumbling through this myself). But it looks like inference is still not really implemented - see groups.google.com/forum/#!topic/gensim/EFy1f0QwkKI. Thankfully, there are at least a couple people actively working on it. I'm guessing the sequence will be something like 1) Load pre-trained vectors; 2) Create a vector for your unseen sentence with a new label ; 3) Call most_similar("NEW_LABEL"). Alternatively, create vectors for multiple unseen sentences and compute distances between those vectors. But that's just a guess.Marleenmarlen
I know this is bit old, but did you manage to figure out how to get GloVe and doc2vec to work together?Godhead

© 2022 - 2024 — McMap. All rights reserved.