Keras Text Preprocessing - Saving Tokenizer object to file for scoring
Asked Answered
K

6

65

I've trained a sentiment classifier model using Keras library by following the below steps(broadly).

  1. Convert Text corpus into sequences using Tokenizer object/class
  2. Build a model using the model.fit() method
  3. Evaluate this model

Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?

Kavita answered 17/8, 2017 at 12:25 Comment(0)
P
115

The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:

import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)
Pashto answered 17/8, 2017 at 14:15 Comment(6)
Do you call tokenizer.fit_on_texts again on test set?Vigue
No. If you call fit* again it could change the index. The pickle loaded tokenizer is ready to use.Sapro
Wait. You have to save both a model and a tokenizer in order to run a model in the future?Carnelian
of course! they have 2 differents roles, the tokenizer will transform text into vectors, it's important to have the same vector space between training & testing.Crying
I'm downvoting in favor of the built in solution within the object itself.Farcy
@David: But it wasn't available at the time when the answer was given. And it is still valid. You can write your own answer with appropriate usage of a built-in function.Glandular
N
40

Tokenizer class has a function to save date into JSON format:

tokenizer_json = tokenizer.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:

with open('tokenizer.json') as f:
    data = json.load(f)
    tokenizer = tokenizer_from_json(data)
Newberry answered 1/2, 2019 at 10:58 Comment(4)
tokenizer_from_json doesnt seem to be available in Keras anymore, or rather its not listed in their docs or available in the package in conda @Newberry you still do it this way?Dior
@Dior I use Keras-Preprocessing==1.0.9 package from PyPI and the function is avaiableNewberry
tokenizer_to_json should be available on tensorflow > 2.0.0 at some point soon, see this pr In the meantime from keras_preprocessing.text import tokenizer_from_json can be usedAmphisbaena
This worked for me. Thank youYehudi
H
9

The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).

Concrete Example:

from keras.preprocessing.text import Tokenizer

docs = ["A heart that",
         "full up like",
         "a landfill",
        "no surprises",
        "and no alarms"
         "a job that slowly"
         "Bruises that",
         "You look so",
         "tired happy",
         "no alarms",
        "and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT  TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train)  # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))

# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs)  # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))

Results:

result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]

Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].

Hippocrene answered 6/7, 2018 at 6:4 Comment(5)
moral of the story: if using word embeddings and keras's Tokenizer, use fit_on_texts only once on a very large corpus; or use character n-grams instead.Hippocrene
I don't understand what's the message you're trying to communicate: why would one fit on test docs in the first place? By definition, whatever it is that you're doing, the test must be kept in a vault as if you didn't know you had it in the first place.Clementclementas
@gented: you may be confusing unsupervised text parsing with supervised ML. Correct me if I'm wrong, but keras's Tokenizer does not have a loss function attached to it that is meant for generalization; hence, is not a (supervised) machine learning problem -- which appears to be your assumption. The message I was trying to communicate is summarized in my first comment above ("moral of the story..."), which may be worth re-reading.Hippocrene
@Clementclementas good points. sorry if the nomenclature confused you; I was keeping some consistency with the comments in the accepted answer.Hippocrene
I agree with @Clementclementas in that you do not want to fit your tokenizer in the test set because then you remove the possibility of oov tokens at test time, defeating the purpose of a test set. It's not about the tokenizer having a loss, but rather about the data from the test set leaking into your training data.Tarsier
N
1

I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True

Neoplatonism answered 2/2, 2018 at 16:58 Comment(3)
the linked gist looks like a good way to "reload" a trained tokenizer. However, the original question potentially relates to "extending" a previously saved tokenizer to new (test) texts; this part still seems open (otherwise, why "save" a model if it won't be used to "score" new data?)Hippocrene
I think their intents are clear "Without this I'll have to process the corpus every time I need to score even a single sentence". From this, I gather that they want to skip the tokenizing step and evaluate the trained model on other data. They don't ask anything else, which is that you are anticipating. They like most people, only want to use previously tokenized on a different data set which is skipped in most tutorials. Therefore, I think my answer 1) answers what was asked, and 2) provides working code.Neoplatonism
fair points. the question is "Saving Tokenizer object to file for scoring" so one might assume they're asking about scoring (potentially new data), too.Hippocrene
G
0

I found the following snippet provided at following link by @thusv89.

Save objects:

import pickle

with open('data_objects.pickle', 'wb') as handle:
    pickle.dump(
        {'input_tensor': input_tensor, 
         'target_tensor': target_tensor, 
         'inp_lang': inp_lang,
         'targ_lang': targ_lang,
        }, handle, protocol=pickle.HIGHEST_PROTOCOL)

Load objects:

with open("dataset_fr_en.pickle", 'rb') as f:
    data = pickle.load(f)
    input_tensor = data['input_tensor']
    target_tensor = data['target_tensor']
    inp_lang = data['inp_lang']
    targ_lang = data['targ_lang']
Ge answered 10/1, 2021 at 15:53 Comment(0)
G
0

Quite easy, because Tokenizer class has provided two funtions for save and load:

save —— Tokenizer.to_json()

load —— keras.preprocessing.text.tokenizer_from_json

In to_json() method,it call "get_config" method which handle this:

    json_word_counts = json.dumps(self.word_counts)
    json_word_docs = json.dumps(self.word_docs)
    json_index_docs = json.dumps(self.index_docs)
    json_word_index = json.dumps(self.word_index)
    json_index_word = json.dumps(self.index_word)

    return {
        'num_words': self.num_words,
        'filters': self.filters,
        'lower': self.lower,
        'split': self.split,
        'char_level': self.char_level,
        'oov_token': self.oov_token,
        'document_count': self.document_count,
        'word_counts': json_word_counts,
        'word_docs': json_word_docs,
        'index_docs': json_index_docs,
        'index_word': json_index_word,
        'word_index': json_word_index
    }
Gladiatorial answered 16/9, 2021 at 2:40 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Bartolemo

© 2022 - 2024 — McMap. All rights reserved.