What is the best way to handle missing words when using word embeddings?
Asked Answered
I

2

8

I have a set of pre-trained word2vec word vectors and a corpus. I want to use the word vectors to represent words in the corpus. The corpus has some words in it that I don't have trained word vectors for. What's the best way to handle those words for which there is no pre-trained vector?

I've heard several suggestions.

  1. use a vector of zeros for every missing word

  2. use a vector of random numbers for every missing word (with a bunch of suggestions on how to bound those randoms)

  3. an idea I had: take a vector whose values are the mean of all values in that position from all pre-trained vectors

Anyone with experience with the problem have thoughts on how to handle this?

Indirection answered 9/2, 2018 at 1:51 Comment(0)
W
5

FastText from Facebook assembles word vectors from subword n-grams which allows it to handle out of vocabulary words. See more about this approach at: Out of Vocab Word Embedding

Wholesome answered 9/2, 2018 at 2:13 Comment(1)
Are you aware of any tutorial with python code on how to create vectors for missing vocabulary with fasttext while passing the weights to your embedings matrix. Can't find anything on it.Rigamarole
L
4

In a pre-trained word2vec embedding matrix, you can usually use word unk as index to find a predesigned vector which is often the best vector.

Liminal answered 27/11, 2018 at 10:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.