Keras Tokenizer num_words doesn't seem to work
Asked Answered
R

4

31
>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> t.word_index
{'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11, 'other': 9, 'so': 5, 'world': 1, 'hello': 4}

I'd have expected t.word_index to have just the top 3 words. What am I doing wrong?

Reconstitute answered 13/9, 2017 at 16:24 Comment(1)
W
29

There is nothing wrong in what you are doing. word_index is computed the same way no matter how many most frequent words you will use later (as you may see here). So when you will call any transformative method - Tokenizer will use only three most common words and at the same time, it will keep the counter of all words - even when it's obvious that it will not use it later.

Washtub answered 13/9, 2017 at 19:13 Comment(1)
So num_words has no bearing on fit_on_texts() either?Honolulu
S
10

Just a add on Marcin's answer ("it will keep the counter of all words - even when it's obvious that it will not use it later.").

The reason it keeps counter on all words is that you can call fit_on_texts multiple times. Each time it will update the internal counters, and when transformations are called, it will use the top words based on the updated counters.

Hope it helps.

Shulins answered 24/10, 2019 at 6:33 Comment(0)
R
5

Limiting num_words to a small number (eg, 3) has no effect on fit_on_texts outputs such as word_index, word_counts, word_docs. It does have effect on texts_to_matrix. The resulting matrix will have num_words (3) columns.

>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> print(t.word_index)
{'world': 1, 'this': 2, 'is': 3, 'hello': 4, 'so': 5, 'fantastic': 6, 'there': 7, 'no': 8, 'other': 9, 'like': 10, 'one': 11}

>>> t.texts_to_matrix(l, mode='count')
array([[0., 1., 1.],       
       [0., 1., 1.]])
Rabbitfish answered 30/11, 2019 at 1:3 Comment(0)
M
2

Just to add a little bit to farid khafizov's answer, words at sequence of num_words and above are removed from the results of texts_to_sequences (4 in 1st, 5 in 2nd and 6 in 3rd sentence disappeared respectively)

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

print(tf.__version__) # 2.4.1, in my case
sentences = [
    'I love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words=4)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
seq = tokenizer.texts_to_sequences(sentences)
print(word_index)  # {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}
print(seq)         # [[3, 1, 2], [3, 1, 2], [1, 2]]
Mokpo answered 12/5, 2021 at 3:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.