How to conceptually think about relationship between tokenized words and word embeddings?
Asked Answered
W

2

6

I have been using JJ Allaire's guide to using word embeddings in neural network model for text processing (https://jjallaire.github.io/deep-learning-with-r-notebooks/notebooks/6.1-using-word-embeddings.nb.html). I am confused as to how the model relates the tokenized sequences of words (x_train) back to the word embeddings that are defined using the whole dataset (instead of just the training data). Is there a way to conceptualize how the word tokens are mapped to word embeddings? Otherwise, how does a word like 'king' map to the word embedding (obtained using Glove for example). I am speaking to the relation between these chunks of code:

#building model 
history <- model %>% fit(
 x_train, y_train,
 epochs = 20,
 batch_size = 32,
 validation_data = list(x_val, y_val)
)

#relating model to word embeddings
model <- keras_model_sequential() %>% 
layer_embedding(input_dim = max_words, output_dim = embedding_dim, 
              input_length = maxlen) %>% 
layer_flatten() %>% 
layer_dense(units = 32, activation = "relu") %>% 
layer_dense(units = 1, activation = "sigmoid")

get_layer(model, index = 1) %>% 
 set_weights(list(embedding_matrix)) %>% 
 freeze_weights()

How is a tokenized word from the x_train linked back to a word in the embedding_matrix (especially if the embedding layer is trained on all data)?

Watcher answered 4/5, 2018 at 23:32 Comment(0)
H
10

In short

Conceptually, the keras::layer_embedding() takes a 2D matrix [samples, word_sequences], where the values are the integer word identifiers (word index), and replaces said value with their word vector, so that it becomes a 3D matrix [samples, word_sequences, embeddings] -- in other words, where the values are word vectors, not word identifiers. The word vectors that are glued on can come from somewhere else, as in your example above, or they can be randomly initialized and updated during training.


In less short

You pass keras::layer_embedding() word sequences. train_x is a 2D matrix where the rows are samples (documents) and the columns are word sequences. The values in train_x are the integer identifiers (word index) for each word, corresponding to their position in your separately stored list of words (vocabulary). We can stylize train_x as:

enter image description here

Here, the value 75 corresponds to the word in the 75th position of your vocabulary.

The embedding_matrix you are passing to keras::set_weights() is a 2D matrix where the rows matches that of your vocabulary. For instance, the values on the 75th row of embedding_matrix are the word vectors for the word in the 75th position of your vocabulary.

So if you are gluing on pre-trained embeddings, as in your example above, then keras::layer_embedding() simply replaces the word index with the word vectors in that row of embedding_matrix. We can stylize the operation as

for (x in 1:nrow(train_x)) {
  for (y in 1:ncol(train_x)) {
    train_x[x, y] <- embedding_matrix[train_x[x, y], ]
  }
}

We therefore end with a 3d matrix (a cube), which we can stylize as:

enter image description here

Hinckley answered 15/5, 2018 at 0:23 Comment(2)
Is tokenization same as feature extraction?Lubberly
@Lubberly In a nutshell, yes. Tokenization is breaking your sentences/ documents into words and then converting them to their respective tokens/ word index (after removing stop words, etc).Cynara
C
1

The tokenizer contains two dictionaries, one is words->index, another is index->words. The index shows the frequency of the word, so it comes up with just count how many times the word appears in all data set, the word appears more the index would be smaller.

Word Embedding is something like a dictionary, it maps word or index to the vector, say we want to represent a word with 128 dims vector. It can be trained on a huge data set, you can use GloVe or Word2Vec (skip-gram model). In Keras you can easily add Embedding layers, Embedding layers learn how to represent an index via a vector.

I think your training data and test data come from the same distribution, so either word index or embedding vectors should be equal, that's the reason why we train the embedding on whole data set.

Chronological answered 4/5, 2018 at 23:53 Comment(1)
Thanks for the response. I think my question is more with regards to how a sequence of tokens from training data is mapped to a word embedding of that token. Embedding layer is represented as a matrix (without any word notation). How would a word map back to that matrix of numeric? If training data has a word like 'king', how does it map to its embedding representation [0.066585518 0.025939867 -0.001646227 -0.07097735 -0.008546143 -0.04104300 0.0001149562] (especially when there is no sequential order to the mappings)?Watcher

© 2022 - 2024 — McMap. All rights reserved.