Reloading Keras Tokenizer during Testing
Asked Answered
D

1

8

I followed the tutorial here: (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)

However, I modified the code to be able to save the generated model through h5py. Thus, after running the training script, I have a generated model.h5 in my directory.

Now, when I want to load it, my problem is that I'm confused as to how to re-initiate the Tokenizer. The tutorial has the following line of code:

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

But hypothetically, if I reload the model.h5 in a different module, I'll need to create another Tokenizer to tokenize the test set. But then, the new Tokenizer will be fit on the test data thus creating a completely different word table.

Therefore, my question is: How do I reload the Tokenizer that was trained on the training dataset? Am I in some way misunderstanding the functionality of the Embedding layer in Keras? Right now, I'm assuming that since we mapped certain word indices to their corresponding embedding vectors based on the pre-trained word embeddings, the word indices need to be consistent. However, this is not possible if we perform another fit_on_texts on the test dataset.

Thank you and looking forward to your answers!

Dunkle answered 26/6, 2017 at 13:31 Comment(1)
Possible duplicate of Keras Text Preprocessing - Saving Tokenizer object to file for scoringJerry
C
2

Check out this question The commenter recommends using a pickle to save the object & state, though the question still remains why this kind of functionality is not built into keras.

Concordant answered 5/7, 2017 at 18:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.