I am trying to build a translation network using embedding and RNN. I have trained a Gensim Word2Vec model and it is learning word associations pretty well. However, I couldn’t get my head around how to properly add the layer to a Keras model. (And how to do an ‘inverse embedding’ for the output. But that’s another question that had been answered: by default you can’t.)
In Word2Vec, when you input a string, e.g. model[‘hello’]
, you get a vector representation of the word. However, I believe that the keras.layers.Embedding
layer returned by Word2Vec's get_keras_embedding() takes a one-hot/tokenized input, instead of a string input. But the documentation provides no explanation on what the appropriate input is. I cannot figure out how to obtain the one-hot/tokenized vector of the vocabulary that has 1-to-1 correspondence with the Embedding layer’s input.
More elaboration below:
Currently my workaround is to apply the embedding outside Keras before feeding it to the network. Is there any detriment in doing this? I will set the embedding to non-trainable anyway. So far I have noticed that memory use is extremely inefficient (like 50GB even before declaring the Keras model for a collection of 64-word-long sentences) having to load the padded inputs and the weights outside the model. Maybe generator can help.
The following is my code. Inputs are padded to 64-words long. The Word2Vec embedding has 300 dimensions. There are probably a lot of mistakes here due to repeated experimentation trying to make embedding work. Suggestions are welcome.
import gensim
word2vec_model = gensim.models.Word2Vec.load(“word2vec.model")
from keras.models import Sequential
from keras.layers import Embedding, GRU, Input, Flatten, Dense, TimeDistributed, Activation, PReLU, RepeatVector, Bidirectional, Dropout
from keras.optimizers import Adam, Adadelta
from keras.callbacks import ModelCheckpoint
from keras.losses import sparse_categorical_crossentropy, mean_squared_error, cosine_proximity
keras_model = Sequential()
keras_model.add(word2vec_model.get_keras_embedding(train_embeddings=False))
keras_model.add(Bidirectional(GRU(300, return_sequences=True, dropout=0.1, recurrent_dropout=0.1, activation='tanh')))
keras_model.add(TimeDistributed(Dense(600, activation='tanh')))
# keras_model.add(PReLU())
# ^ For some reason I get error when I add Activation ‘outside’:
# int() argument must be a string, a bytes-like object or a number, not 'NoneType'
# But keras_model.add(Activation('relu')) works.
keras_model.add(Dense(source_arr.shape[1] * source_arr.shape[2]))
# size = max-output-sentence-length * embedding-dimensions to learn the embedding vector and find the nearest word in word2vec_model.similar_by_vector() afterwards.
# Alternatively one can use Dense(vocab_size) and train the network to output one-hot categorical words instead.
# Remember to change Keras loss to sparse_categorical_crossentropy.
# But this won’t benefit from Word2Vec.
keras_model.compile(loss=mean_squared_error,
optimizer=Adadelta(),
metrics=['mean_absolute_error'])
keras_model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_19 (Embedding) (None, None, 300) 8219700
_________________________________________________________________
bidirectional_17 (Bidirectio (None, None, 600) 1081800
_________________________________________________________________
activation_4 (Activation) (None, None, 600) 0
_________________________________________________________________
time_distributed_17 (TimeDis (None, None, 600) 360600
_________________________________________________________________
dense_24 (Dense) (None, None, 19200) 11539200
=================================================================
Total params: 21,201,300
Trainable params: 12,981,600
Non-trainable params: 8,219,700
_________________________________________________________________
filepath="best-weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
callbacks_list = [checkpoint]
keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)
Which throws an error when I try to fit the model with text:
Train on 8000 samples, validate on 2000 samples Epoch 1/100
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-32-865f8b75fbc3> in <module>()
2 checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
3 callbacks_list = [checkpoint]
----> 4 keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)
~/virtualenv/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1040 initial_epoch=initial_epoch,
1041 steps_per_epoch=steps_per_epoch,
-> 1042 validation_steps=validation_steps)
1043
1044 def evaluate(self, x=None, y=None,
~/virtualenv/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
197 ins_batch[i] = ins_batch[i].toarray()
198
--> 199 outs = f(ins_batch)
200 if not isinstance(outs, list):
201 outs = [outs]
~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
2659 return self._legacy_call(inputs)
2660
-> 2661 return self._call(inputs)
2662 else:
2663 if py_any(is_tensor(x) for x in inputs):
~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
2612 array_vals.append(
2613 np.asarray(value,
-> 2614 dtype=tensor.dtype.base_dtype.name))
2615 if self.feed_dict:
2616 for key in sorted(self.feed_dict.keys()):
~/virtualenv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
ValueError: could not convert string to float: 'hello'
The following is an excerpt from Rajmak demonstrating how to use a tokenizer to convert words into the input of a Keras Embedding.
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
……
indices = np.arange(data.shape[0]) # get sequence of row index
np.random.shuffle(indices) # shuffle the row indexes
data = data[indices] # shuffle data/product-titles/x-axis
……
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
……
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
Keras embedding layer can be obtained by Gensim Word2Vec’s word2vec.get_keras_embedding(train_embeddings=False) method or constructed like shown below. The null word embeddings indicate the number of words not found in our pre-trained vectors (In this case Google News). This could possibly be unique words for brands in this context.
from keras.layers import Embedding
word_index = tokenizer.word_index
nb_words = min(MAX_NB_WORDS, len(word_index))+1
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
if word in word2vec.vocab:
embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))
embedding_layer = Embedding(embedding_matrix.shape[0], # or len(word_index) + 1
embedding_matrix.shape[1], # or EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(300, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(150, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(75, 3, padding='valid',activation='relu',strides=2))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(150,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='sigmoid'))
model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])
model.summary()
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128)
score = model.evaluate(x_val, y_val, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
Here the embedding_layer
is explicitly created using:
for word, i in word_index.items():
if word in word2vec.vocab:
embedding_matrix[i] = word2vec.word_vec(word)
However, if we use get_keras_embedding()
, the embedding matrix is already constructed and fixed. I do not know how each word_index in the Tokenizer can be coerced match the corresponding word in get_keras_embedding()
's Keras embedding input.
So, what is the proper way to use Word2Vec's get_keras_embedding() in Keras?