How does Fine-tuning Word Embeddings work? [closed]
Asked Answered
M

2

15

I've been reading some NLP with Deep Learning papers and found Fine-tuning seems to be a simple but yet confusing concept. There's been the same question asked here but still not quite clear.

Fine-tuning pre-trained word embeddings to task-specific word embeddings as mentioned in papers like Y. Kim, “Convolutional Neural Networks for Sentence Classification,” and K. S. Tai, R. Socher, and C. D. Manning, “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks,” had only a brief mention without getting into any details.

My question is:

Word Embeddings generated using word2vec or Glove as pretrained word vectors are used as input features (X) for downstream tasks like parsing or sentiment analysis, meaning those input vectors are plugged into a new neural network model for some specific task, while training this new model, somehow we can get updated task-specific word embeddings.

But as far as I know, during the training, what back-propagation does is updating the weights (W) of the model, it does not change the input features (X), so how exactly does the original word embeddings get fine-tuned? and where do these fine-tuned vectors come from?

Meany answered 31/10, 2016 at 15:41 Comment(0)
G
17

Yes, if you feed the embedding vector as your input, you can't fine-tune the embeddings (at least easily). However, all the frameworks provide some sort of an EmbeddingLayer that takes as input an integer that is the class ordinal of the word/character/other input token, and performs a embedding lookup. Such an embedding layer is very similar to a fully connected layer that is fed a one-hot encoded class, but is way more efficient, as it only needs to fetch/change one row from the matrix on both front and back passes. More importantly, it allows the weights of the embedding to be learned.

So the classic way would be to feed the actual classes to the network instead of embeddings, and prepend the entire network with a embedding layer, that is initialized with word2vec / glove, and which continues learning the weights. It might also be reasonable to freeze them for several iterations at the beginning until the rest of the network starts doing something reasonable with them before you start fine tuning them.

Gratify answered 31/10, 2016 at 18:13 Comment(4)
Thanks for your answer. So the EmbeddingLayer takes inputs such as one-hot encoding and connects with the "real" hidden layer. Is that correct? By initializing the EmbeddingLayer with word2vec/glove, does this mean using them as parameters in the EmbeddingLayer? Would appreciate if you can give an simple example. ThanksMeany
Example for keras blog.keras.io/…, some discussion for TF #35688178. Input to the embedding is usually not one-hot encoded, but is rather just integers. Yes, initializing with glove means setting the weight matrix of the embedding layer to the glove vectors.Gratify
This answer doesn't seem correct to me. First of all, you can indeed fine-tune the embeddings. For example, Keras allows you to load an embedding matrix into an embedding layer and have it be updated through backpropagation. Second, your two statements contradict each other: you can't fine-tune the embeddings and it allows the weights of the embedding to be learned.Trying
There's no contradiction if embedding is passed as the input { you can't fine-tune the embeddings } else if embedding lookup is done using an embedding layer { it allows the weights of the embedding to be learned } The two statements quoted are logically in two separate branches.Gratify
C
0

One hot encoding is the base for constructing initial layer for embeddings. Once you train the network one hot encoding essentially serves as a table lookup. In fine-tuning step you can select data for specific works and mention variables that need to be fine tune when you define the optimizer using something like this

embedding_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="embedding_variables/kernel")
ft_optimizer = tf.train.AdamOptimizer(learning_rate=0.001,name='FineTune')
ft_op = ft_optimizer.minimize(mean_loss,var_list=embedding_variables)

where "embedding_variables/kernel" is the name of the next layer after one-hot encoding.
Caracul answered 20/4, 2020 at 1:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.