Reloading Keras Tokenizer during Testing

I followed the tutorial here: (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)

However, I modified the code to be able to save the generated model through h5py. Thus, after running the training script, I have a generated model.h5 in my directory.

Now, when I want to load it, my problem is that I'm confused as to how to re-initiate the Tokenizer. The tutorial has the following line of code:

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

But hypothetically, if I reload the model.h5 in a different module, I'll need to create another Tokenizer to tokenize the test set. But then, the new Tokenizer will be fit on the test data thus creating a completely different word table.

Therefore, my question is: How do I reload the Tokenizer that was trained on the training dataset? Am I in some way misunderstanding the functionality of the Embedding layer in Keras? Right now, I'm assuming that since we mapped certain word indices to their corresponding embedding vectors based on the pre-trained word embeddings, the word indices need to be consistent. However, this is not possible if we perform another fit_on_texts on the test dataset.

Thank you and looking forward to your answers!

Recommended topics

Hot tags