Using a pre-trained word embedding (word2vec or Glove) in TensorFlow

Asked 28/2, 2016 at 20:11 Answered 21/2, 2020 at 1:29

Solved python numpy tensorflow deep-learning

101

I've recently reviewed an interesting implementation for convolutional text classification. However all TensorFlow code I've reviewed uses a random (not pre-trained) embedding vectors like the following:

with tf.device('/cpu:0'), tf.name_scope("embedding"):
    W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?

Helmick answered 28/2, 2016 at 20:11 Comment(0)

132

There are a few ways that you can use a pre-trained embedding in TensorFlow. Let's say that you have the embedding in a NumPy array called embedding, with vocab_size rows and embedding_dim columns and you want to create a tensor W that can be used in a call to tf.nn.embedding_lookup().

Simply create W as a tf.constant() that takes embedding as its value:
```
W = tf.constant(embedding, name="W")
```
This is the easiest approach, but it is not memory efficient because the value of a tf.constant() is stored multiple times in memory. Since embedding can be very large, you should only use this approach for toy examples.
Create W as a tf.Variable and initialize it from the NumPy array via a tf.placeholder():
```
W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                trainable=False, name="W")

embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)

# ...
sess = tf.Session()

sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
```
This avoid storing a copy of embedding in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable). Note that I've assumed that you want to hold the embedding matrix constant during training, so W is created with trainable=False.
If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saver to load the value from the other model's checkpoint file. This means that the embedding matrix can bypass Python altogether. Create W as in option 2, then do the following:
```
W = tf.Variable(...)

embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})

# ...
sess = tf.Session()
embedding_saver.restore(sess, "checkpoint_filename.ckpt")
```

Trometer answered 28/2, 2016 at 20:59 Comment(11)

I create W as follows: W = np.loadtxt("/media/w2vTest.txt",dtype ='string',delimiter = ' ') that creates as a row: ['in' '0.070312......'-0.0625']. There are problems here! shall I consider this as my W after removing 'in' and converting numbers from string to float32? if this is the case, then how to connect 'in' to its respective vector? OR I need to convert figures to float32 and then leave 'in' as it is ; expecting that tensorflow will do all required processing? Thanks! – Helmick 29/2, 2016 at 0:44

Ah, you have a couple of options here. You could use the TensorFlow tf.decode_csv() op to convert the text file into a tensor, but this might be expensive (in particular, it requires you to create one Tensor per column, and then concatenate the numeric ones together). Perhaps an easier alternative would be to use pandas.read_csv() and pandas.DataFrame.as_matrix() to get the input as a NumPy array. – Trometer 29/2, 2016 at 20:57

Using option 2, is there a way to throw away the NumPy array and save some memory? – Legere 7/3, 2016 at 2:55

The NumPy array should be garbage collected after the call to sess.run(embedding_init, ...) returns (assuming you don't keep a reference to it in your program). Depending on the structure of your program, you might want to del embedding (where embedding is the NumPy array) to release the array earlier. – Trometer 7/3, 2016 at 5:42

@mrry: can you talk more about option 1 and more specifically "it is not memory efficient because the value of a tf.constant() is stored multiple times in memory". Memory inefficient for the GPU or the CPU? More generally, why do tf.constant() have to have multiple copies in memory, while the tf.Variable() + feeding placeholder of option 2 does not have this problem? – Slavey 13/12, 2016 at 0:34

@Trometer i want to use pretrained vectors from glove and bio_nlp, then train some other words that are not in these using new resources. how can i set specific words to trainable ? is there a way to do this operation ? – Murrumbidgee 12/8, 2017 at 18:57

If you also wonder why "the value of a tf.constant() is stored multiple times in memory" take a look at this answer: https://mcmap.net/q/212448/-why-is-the-value-of-a-tf-constant-stored-multiple-times-in-memory-in-tensorflow – Ballata 31/10, 2017 at 13:23

Hi, I am using the method you above. Since embedding is a numpy array and it pickled from local path. Although I have use tf.get_variable_scope().reuse_variables() after first initialize the embedding, the numpy array loading operation appears to be 4 times (there are overall 4 times of initializing and the reuse operation is between first and second initialize). I don't know why it initialize 4 times and what I have done is right ? – Almazan 2/1, 2018 at 6:16

@Trometer passing a huge matrix each run seems very expensive. Is it faster to just do the lookup outside TF and then pass the embdedded data directly? – Dismissal 23/3, 2018 at 23:13

@Dismissal None of these approaches involves passing the whole matrix on each run. The aim is to initialize a constant/variable in at most one run call, and then tf.nn.embedding_lookup() will perform sparse lookups on the same matrix. – Trometer 27/3, 2018 at 15:55

I tried using 2nd approach but when I passed my embedding matrix in feed and ran the session it game an error which states, "NameError: name 'embedding_init' is not defined". Please help me figure it out as I have tried most of things (i know i am missing something). – Authorization 1/12, 2018 at 6:43

I use this method to load and share embedding.

W = tf.get_variable(name="W", shape=embedding.shape, initializer=tf.constant_initializer(embedding), trainable=False)

Offenseless answered 27/4, 2016 at 3:58 Comment(1)

Should the embedding be columns or rows in the numpy matrix? – Faraday 13/7, 2018 at 5:13

2.0 Compatible Answer: There are many Pre-Trained Embeddings, which are developed by Google and which have been Open Sourced.

Some of them are Universal Sentence Encoder (USE), ELMO, BERT, etc.. and it is very easy to reuse them in your code.

Code to reuse the Pre-Trained Embedding, Universal Sentence Encoder is shown below:

  !pip install "tensorflow_hub>=0.6.0"
  !pip install "tensorflow>=2.0.0"

  import tensorflow as tf
  import tensorflow_hub as hub

  module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
  embed = hub.KerasLayer(module_url)
  embeddings = embed(["A long sentence.", "single-word",
                      "http://example.com"])
  print(embeddings.shape)  #(3,128)

For more information the Pre-Trained Embeddings developed and open-sourced by Google, refer TF Hub Link.

Hindman answered 8/1, 2020 at 12:23 Comment(0)

The answer of @mrry is not right because it provoques the overwriting of the embeddings weights each the network is run, so if you are following a minibatch approach to train your network, you are overwriting the weights of the embeddings. So, on my point of view the right way to pre-trained embeddings is:

embeddings = tf.get_variable("embeddings", shape=[dim1, dim2], initializer=tf.constant_initializer(np.array(embeddings_matrix))

Fleece answered 24/10, 2016 at 9:25 Comment(5)

Exact duplicate of LiuJia's answer. – Dillard 2/12, 2016 at 10:21

@Dillard .. In fact, he is missing the trainable=False argument, and thus will end up fine-tuning his embeddings in the process. – Paradox 9/12, 2016 at 22:40

Also, I think Eugenio's reasoning is incorrect. You just don't have to run the "embedding_init" op with every mini-batch, and everything will be fine. That is, just run the embedding initialization only once at the start of training. – Paradox 3/3, 2017 at 7:49

@Paradox how do I ensure that the embedding initialisation is run only at the beginning of training? – Hibbert 1/9, 2017 at 8:21

@dust0x .. If the size of the embeddings is small enough, you can just specify them as the initial value. If they are quite large, you can pass them in the feed_dict when you run the initializer for all variables. Do let me know if it is not clear enough, and I'll try to post some sample code for both approaches. – Paradox 14/9, 2017 at 8:48

With tensorflow version 2 its quite easy if you use the Embedding layer

X=tf.keras.layers.Embedding(input_dim=vocab_size,
                            output_dim=300,
                            input_length=Length_of_input_sequences,
                            embeddings_initializer=matrix_of_pretrained_weights
                            )(ur_inp)

Lorrianelorrie answered 21/2, 2020 at 1:29 Comment(0)

I was also facing embedding issue, So i wrote detailed tutorial with dataset. Here I would like to add what I tried You can also try this method,

import tensorflow as tf

tf.reset_default_graph()

input_x=tf.placeholder(tf.int32,shape=[None,None])

#you have to edit shape according to your embedding size


Word_embedding = tf.get_variable(name="W", shape=[400000,100], initializer=tf.constant_initializer(np.array(word_embedding)), trainable=False)
embedding_loopup= tf.nn.embedding_lookup(Word_embedding,input_x)

with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for ii in final_:
            print(sess.run(embedding_loopup,feed_dict={input_x:[ii]}))

Here is working detailed Tutorial Ipython example if you want to understand from scratch , take a look .

Tumbling answered 11/4, 2018 at 15:59 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags