How I can get the vectors for words that were not present in word2vec vocabulary?

About

Asked 4/7, 2018 at 7:49 Answered 26/7, 2018 at 8:38

python-3.x pandas word2vec gensim text-classification

I have check the previous post link but it doesn't seems to work for my case:-

I have pre trained word2vec model:

import gensim    
model = Word2Vec.load('w2v_model')

Now I have a pandas dataframe with keywords:

keyword
corruption
people
budget
cambodia
.......
......

All I want to add the vectors for each keyword in its corresponding columns but when I use model['cambodia'] it throw me error as KeyError: "word 'cambodia' not in vocabulary"

so I have update the keyword as:

model.train(['cambodia'])

But this won't work out for me, when I use model['cambodia']

it still giving an error as KeyError: "word 'cambodia' not in vocabulary". How to update new words into word2vec vocabulary so i can get its vectors? Expected output will be:-

keyword    V1         V2          V3         V4            V5         V6   
corruption 0.07397  0.290874    -0.170812   0.085428    -0.148551   0.38846 
people      ..............................................................
budget      ...........................................................

Analcite answered 4/7, 2018 at 7:49 Comment(6)

Possible duplicate of Update gensim word2vec model - "The word2vec algorithm doesn’t support adding new words dynamically." So, no, it isn't possible unless you retrain the entire model with the new vocab. – Diphenylhydantoin 4/7, 2018 at 10:1

@ukemi The post you have mentioned doesn't seems to work for me. I have already checked it. – Analcite 4/7, 2018 at 10:22

@ukemi isn't linking to a post, but a comment on the post. That comment says that what you want isn't possible. You have to train the model with all words you want to be vectorised. I think FastText can handle out-of-vocabulary words if you configure it right – Skeens 9/7, 2018 at 5:56

You can also look into Hash Embeddings which can be trained similar to word2vec and later be updated with new words. – Methedrine 13/7, 2018 at 11:20

@Methedrine is it possible to add new words in google word2vec pretrained model vocab? So I can use its vector later? – Analcite 13/7, 2018 at 11:22

@Analcite I'm not familiar with the details of the various word2vec implementations. Generally, if you load pretrained embedding weights into a trainable model, and allow these pretrained weigths to change, the model will learn which embedding vectors to assign to new input words. However, this is not trivial and you will likely arrive at a task-specific embedding (usually that's fine) which does not generalize well. – Methedrine 13/7, 2018 at 11:58

You can initial the first vector as [0,0,...0]. And the word that not in vocabulary can set to 0.

keyword    V1         V2          V3         V4            V5         V6  
0          0          0           0           0           0           0
1       0.07397  0.290874    -0.170812   0.085428    -0.148551   0.38846 
2      ..............................................................
3      ...........................................................

You can use two dicts to solve the problem.

word2id['corruption']=1 
vec['corruption']=[0.07397 0.290874 -0.170812 0.085428 -0.148551 0.38846]
 ...
word2id['cambodia']=0 
vec['cambodia']=[0 0 0 0 0 0]

Daumier answered 26/7, 2018 at 8:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags