How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?
Asked Answered
G

1

11

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset?

Why this way, I don't want to limit the tokenizer vocabulary size to know how well my Keras model perform without it. But then I need to pass on this vocabulary size as the argument in the model's first layer definition.

Gehlbach answered 28/11, 2018 at 18:37 Comment(0)
U
18

All the words and their indices will be stored in a dictionary which you can access it using tokenizer.word_index. Therefore, you can find the number of the unique words based on the number of elements in this dictionary:

num_words = len(tokenizer.word_index) + 1

That + 1 is because of reserving padding (i.e. index zero).

Note: This solution (obviously) is applicable when you have not set num_words argument (i.e. you don't know or want to limit the number of words), since word_index contains all the words (and not only the most frequent words) no matter you set num_words or not.

Unlicensed answered 28/11, 2018 at 18:44 Comment(7)
Doesn't seem right, because when I initiate the tokenizer as Tokenizer(num_words=50000) and execute len(tokenizer.word_index) + 1 I see a number like 75000, way more than the limit that I had defined. How is this possible?Gehlbach
@Gehlbach You mentioned you don't want to set the num_words. The word_index contains all the words, no matter you set num_words or not. Therefore, this solution works when you have not limited the number of words (i.e. have not set num_words argument). Otherwise, if you have set the num_words then you know what the number of words is and you don't need this solution in the first place! :) I added a note to my answer to clarify this.Unlicensed
I was pointing at validating the assumption that vocabulary_size = len(tokenizer.word_index)+1 is failing.Gehlbach
I think +1 is for "Out of Vocabulary" wordPartisan
@Partisan the you could print out the word_index, there is a OOV . so what is the reserving padding means?Molech
Could u explain a bit about + 1 is because of reserving paddingMolech
@EnXie Usually (though, not always) you will use a padding token to pad input to make them have the same length. By default it's mapped to index zero, and it's not included in word_index. If you don't want to count that, then don't use +1.Unlicensed

© 2022 - 2024 — McMap. All rights reserved.