How I can get the vectors for words that were not present in word2vec vocabulary?
Asked Answered
A

1

6

I have check the previous post link but it doesn't seems to work for my case:-

I have pre trained word2vec model:

import gensim    
model = Word2Vec.load('w2v_model')

Now I have a pandas dataframe with keywords:

keyword
corruption
people
budget
cambodia
.......
......

All I want to add the vectors for each keyword in its corresponding columns but when I use model['cambodia'] it throw me error as KeyError: "word 'cambodia' not in vocabulary"

so I have update the keyword as:

model.train(['cambodia'])

But this won't work out for me, when I use model['cambodia']

it still giving an error as KeyError: "word 'cambodia' not in vocabulary". How to update new words into word2vec vocabulary so i can get its vectors? Expected output will be:-

keyword    V1         V2          V3         V4            V5         V6   
corruption 0.07397  0.290874    -0.170812   0.085428    -0.148551   0.38846 
people      ..............................................................
budget      ...........................................................
Analcite answered 4/7, 2018 at 7:49 Comment(6)
Possible duplicate of Update gensim word2vec model - "The word2vec algorithm doesn’t support adding new words dynamically." So, no, it isn't possible unless you retrain the entire model with the new vocab.Diphenylhydantoin
@ukemi The post you have mentioned doesn't seems to work for me. I have already checked it.Analcite
@ukemi isn't linking to a post, but a comment on the post. That comment says that what you want isn't possible. You have to train the model with all words you want to be vectorised. I think FastText can handle out-of-vocabulary words if you configure it rightSkeens
You can also look into Hash Embeddings which can be trained similar to word2vec and later be updated with new words.Methedrine
@Methedrine is it possible to add new words in google word2vec pretrained model vocab? So I can use its vector later?Analcite
@Analcite I'm not familiar with the details of the various word2vec implementations. Generally, if you load pretrained embedding weights into a trainable model, and allow these pretrained weigths to change, the model will learn which embedding vectors to assign to new input words. However, this is not trivial and you will likely arrive at a task-specific embedding (usually that's fine) which does not generalize well.Methedrine
D
1

You can initial the first vector as [0,0,...0]. And the word that not in vocabulary can set to 0.

keyword    V1         V2          V3         V4            V5         V6  
0          0          0           0           0           0           0
1       0.07397  0.290874    -0.170812   0.085428    -0.148551   0.38846 
2      ..............................................................
3      ...........................................................

You can use two dicts to solve the problem.

word2id['corruption']=1 
vec['corruption']=[0.07397 0.290874 -0.170812 0.085428 -0.148551 0.38846]
 ...
word2id['cambodia']=0 
vec['cambodia']=[0 0 0 0 0 0]
Daumier answered 26/7, 2018 at 8:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.