I want to construct word embeddings for documents using the word2vec tool. I know how to find a vector embedding corresponding to a single word (unigram). Now, I want to find a vector for a bigram. Is it possible to construct a bigram word embedding using word2vec? If yes, how?
Bigram vector representations using word2vec
Asked Answered
The following snippet will get you the vector representation of a bigram. Note that the bigram you want to convert to a vector needs to have an underscore instead of a space between the words, e.g. bigram2vec(unigrams, "this report")
is wrong, it should be bigram2vec(unigrams, "this_report")
. For more details on generating the unigrams, please see the gensim.models.word2vec.Word2Vec
class here.
from gensim.models import word2vec
def bigram2vec(unigrams, bigram_to_search):
bigrams = Phrases(unigrams)
model = word2vec.Word2Vec(bigrams[unigrams])
if bigram_to_search in model.vocab.keys():
return model[bigram_to_search]
else:
return None
What is
unigrams
here? –
Cinematograph Good question,
unigrams
is the corpus words represented as a list. More details with an example here: radimrehurek.com/gensim/models/phrases.html –
Kissel Note that
unigrams
must be a list of lists. Further, model.vocab.keys()
no longer works. It's replaced with model.wv.index_to_key
–
Linebreeding © 2022 - 2024 — McMap. All rights reserved.
from gensim.models import Word2Vec, Phrases
– Ethnography