How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

About

Asked 28/11, 2018 at 18:37 Answered 28/11, 2018 at 18:44

Solved machine-learning keras deep-learning nlp tokenize

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset?

Why this way, I don't want to limit the tokenizer vocabulary size to know how well my Keras model perform without it. But then I need to pass on this vocabulary size as the argument in the model's first layer definition.

Gehlbach answered 28/11, 2018 at 18:37 Comment(0)

All the words and their indices will be stored in a dictionary which you can access it using tokenizer.word_index. Therefore, you can find the number of the unique words based on the number of elements in this dictionary:

num_words = len(tokenizer.word_index) + 1

That + 1 is because of reserving padding (i.e. index zero).

Note: This solution (obviously) is applicable when you have not set num_words argument (i.e. you don't know or want to limit the number of words), since word_index contains all the words (and not only the most frequent words) no matter you set num_words or not.

Unlicensed answered 28/11, 2018 at 18:44 Comment(7)

Doesn't seem right, because when I initiate the tokenizer as Tokenizer(num_words=50000) and execute len(tokenizer.word_index) + 1 I see a number like 75000, way more than the limit that I had defined. How is this possible? – Gehlbach 28/11, 2018 at 18:54

@Gehlbach You mentioned you don't want to set the num_words. The word_index contains all the words, no matter you set num_words or not. Therefore, this solution works when you have not limited the number of words (i.e. have not set num_words argument). Otherwise, if you have set the num_words then you know what the number of words is and you don't need this solution in the first place! :) I added a note to my answer to clarify this. – Unlicensed 28/11, 2018 at 19:19

I was pointing at validating the assumption that vocabulary_size = len(tokenizer.word_index)+1 is failing. – Gehlbach 28/11, 2018 at 20:13

I think +1 is for "Out of Vocabulary" word – Partisan 10/2, 2020 at 0:58

@Partisan the you could print out the word_index, there is a OOV . so what is the reserving padding means? – Molech 7/5, 2022 at 11:9

Could u explain a bit about + 1 is because of reserving padding – Molech 7/5, 2022 at 11:35

@EnXie Usually (though, not always) you will use a padding token to pad input to make them have the same length. By default it's mapped to index zero, and it's not included in word_index. If you don't want to count that, then don't use +1. – Unlicensed 7/5, 2022 at 12:34

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags