So if I were to not pass num_words
argument when initializing Tokenizer()
, how do I find the vocabulary size after it is used to tokenize the training dataset?
Why this way, I don't want to limit the tokenizer vocabulary size to know how well my Keras model perform without it. But then I need to pass on this vocabulary size as the argument in the model's first layer definition.
Tokenizer(num_words=50000)
and executelen(tokenizer.word_index) + 1
I see a number like 75000, way more than the limit that I had defined. How is this possible? – Gehlbach