How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)
Asked Answered
A

3

23

I am curious as to how I can add a normal-randomized 300 dimension vector (elements' type = tf.float32) whenever a word unknown to the pre-trained vocabulary is encountered. I am using pre-trained GloVe word embeddings, but in some cases, I realize I encounter unknown words, and I want to create a normal-randomized word vector for this new found unknown word.

The problem is that with my current set up, I use tf.contrib.lookup.index_table_from_tensor to convert from words to integers based on the known vocabulary. This function can create new tokens and hash them for some predefined number of out of vocabulary words, but my embed will not contain an embedding for this new unknown hash value. I am uncertain if I can simply append a randomized embedding to the end of the embed list.

I also would like to do this in an efficient way, so pre-built tensorflow function or method involving tensorflow functions would probably be the most efficient. I define pre-known special tokens such as an end of sentence token and a default unknown as the empty string ("at index 0), but this is limited in its power to learn for various different unknown words. I currently use tf.nn.embedding_lookup() as the final embedding step.

I would like to be able to add new random 300d vectors for each unknown word in the training data, and I would also like to add pre-made random word vectors for any unknown tokens not seen in training that is possibly encountered during testing. What is the most efficient way of doing this?

def embed_tensor(string_tensor, trainable=True):
    """    
    Convert List of strings into list of indicies then into 300d vectors
    """
    # ordered lists of vocab and corresponding (by index) 300d vector
    vocab, embed = load_pretrained_glove()

    # Set up tensorflow look up from string word to unique integer
    vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
        mapping=tf.constant(vocab),
        default_value = 0)
    string_tensor = vocab_lookup.lookup(string_tensor)

    # define the word embedding 
    embedding_init = tf.Variable(tf.constant(np.asarray(embed),
                                 dtype=tf.float32),
                                 trainable=trainable,
                                 name="embed_init")

    # return the word embedded version of the sentence (300d vectors/word)
    return tf.nn.embedding_lookup(embedding_init, string_tensor)
Aestivate answered 15/7, 2017 at 0:3 Comment(0)
K
11

The code example below adapts your embed_tensor function such that words are embedded as follows:

  • For words that have a pretrained embedding, the embedding is initialized with the pretrained embedding. The embedding can be kept fixed during training if trainable is False.
  • For words in the training data that don't have a pretrained embedding, the embedding is initialized randomly. The embedding can be kept fixed during training if trainable is False.
  • For words in the test data that don't occur in the training data and don't have a pretrained embedding, a single randomly initialized embedding vector is used. This vector can't be trained.
import tensorflow as tf
import numpy as np

EMB_DIM = 300
def load_pretrained_glove():
    return ["a", "cat", "sat", "on", "the", "mat"], np.random.rand(6, EMB_DIM)

def get_train_vocab():
    return ["a", "dog", "sat", "on", "the", "mat"]

def embed_tensor(string_tensor, trainable=True):
  """
  Convert List of strings into list of indices then into 300d vectors
  """
  # ordered lists of vocab and corresponding (by index) 300d vector
  pretrained_vocab, pretrained_embs = load_pretrained_glove()
  train_vocab = get_train_vocab()
  only_in_train = list(set(train_vocab) - set(pretrained_vocab))
  vocab = pretrained_vocab + only_in_train

  # Set up tensorflow look up from string word to unique integer
  vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
    mapping=tf.constant(vocab),
    default_value=len(vocab))
  string_tensor = vocab_lookup.lookup(string_tensor)

  # define the word embedding
  pretrained_embs = tf.get_variable(
      name="embs_pretrained",
      initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
      shape=pretrained_embs.shape,
      trainable=trainable)
  train_embeddings = tf.get_variable(
      name="embs_only_in_train",
      shape=[len(only_in_train), EMB_DIM],
      initializer=tf.random_uniform_initializer(-0.04, 0.04),
      trainable=trainable)
  unk_embedding = tf.get_variable(
      name="unk_embedding",
      shape=[1, EMB_DIM],
      initializer=tf.random_uniform_initializer(-0.04, 0.04),
      trainable=False)

  embeddings = tf.concat([pretrained_embs, train_embeddings, unk_embedding], axis=0)

  return tf.nn.embedding_lookup(embeddings, string_tensor)

FYI, to have a sensible, non-random representation for words that don't occur in the training data and don't have a pretrained embedding, you could consider mapping words with a low frequency in your training data to an unk token (that is not in your vocabulary) and make the unk_embedding trainable. This way you learn a prototype for words that are unseen in the training data.

Kerrikerrie answered 21/8, 2017 at 15:29 Comment(1)
I tried following the above approach. But despite fixing the trainable parameter as false for predefined embeddings, they got changed during the training. https://mcmap.net/q/586491/-ground-pretrained-embedding-while-learning-embedding-for-new-words-in-tensorflow/2061991Offenbach
H
3

I never tried it but I can try to provide a possible way using the same machineries of your code, but I will think of it more later.

The index_table_from_tensor method accepts a num_oov_buckets parameter that shuffles all your oov words into a predefined number of buckets.

If you set this parameter to a certain 'enough large' value, you will see your data spreads among these buckets (each bucket has an ID > ID of the last in-vocabulary word).

So,

  • if (at each lookup) you set (i.e. assign) the last rows (those corresponding to the buckets) of your embedding_init Variable to a random value
  • if you make num_oov_bucketsenough large that collisions will be minimized

you can obtain a behavior that is (an approximation of) what you are asking in a very efficient way.

The random behavior can be justified by a theory similar to the hash table ones: if the number of buckets is enough large, the hashing method of the strings will assign each oov word to a different bucket with high probability (i.e. minimizing collisions to the same buckets). Since, you are assigning a different random number to each different bucket, you can obtain a (almost) different mapping of each oov word.

Huertas answered 19/8, 2017 at 9:32 Comment(0)
M
0

An idea I had for this was to capture the new words to the pre-trained embedding by adding a new dimension for each new word (basically maintaining the one-hot nature of them).

Assuming the number of new words is small but they're important, you could for instance increase the dimensions of your embedded results from 300 to 300 + # of new words where each new word would get all zeros except 1 in it's dimension.

Malayan answered 7/3, 2018 at 12:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.