tokenizer.texts_to_sequences Keras Tokenizer gives almost all zeros
Asked Answered
B

6

9

I am working to create a text classification code but I having problems in encoding documents using the tokenizer.

1) I started by fitting a tokenizer on my document as in here:

vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size, filters='')
tokenizer.fit_on_texts(df['data'])

2) Then I wanted to check if my data is fitted correctly so I converted into sequence as in here:

sequences = tokenizer.texts_to_sequences(df['data'])
data = pad_sequences(sequences, maxlen= num_words) 
print(data) 

which gave me fine output. i.e. encoded words into numbers

[[ 9628  1743    29 ...   161    52   250]
 [14948     1    70 ...    31   108    78]
 [ 2207  1071   155 ... 37607 37608   215]
 ...
 [  145    74   947 ...     1    76    21]
 [   95 11045  1244 ...   693   693   144]
 [   11   133    61 ...    87    57    24]]

Now, I wanted to convert a text into a sequence using the same method. Like this:

sequences = tokenizer.texts_to_sequences("physics is nice ")
text = pad_sequences(sequences, maxlen=num_words)
print(text)

it gave me weird output:

[[   0    0    0    0    0    0    0    0    0  394]
 [   0    0    0    0    0    0    0    0    0 3136]
 [   0    0    0    0    0    0    0    0    0 1383]
 [   0    0    0    0    0    0    0    0    0  507]
 [   0    0    0    0    0    0    0    0    0    1]
 [   0    0    0    0    0    0    0    0    0 1261]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0    0    0 1114]
 [   0    0    0    0    0    0    0    0    0    1]
 [   0    0    0    0    0    0    0    0    0 1261]
 [   0    0    0    0    0    0    0    0    0  753]]

According to Keras documentation (Keras):

texts_to_sequences(texts)

Arguments: texts: list of texts to turn to sequences.

Return: list of sequences (one per text input).

is it not supposed to encode each word to its corresponding number? then pad the text if it shorter than 50 to 50? Where is the mistake ?

Bazaar answered 5/8, 2018 at 23:28 Comment(0)
C
14

I guess you should call like this:

sequences = tokenizer.texts_to_sequences(["physics is nice "])
Coltoncoltsfoot answered 22/3, 2019 at 7:37 Comment(0)
V
2

The error is where you pad the sequences. The value to maxlen should be the maximum tokens you want, e.g. 50. So, change the lines to:

maxlen = 50
data = pad_sequences(sequences, maxlen=maxlen)
sequences = tokenizer.texts_to_sequences("physics is nice ")
text = pad_sequences(sequences, maxlen=maxlen)

This will cut the sequences to 50 tokens and fill the shorter with zeros. Watch out for the padding option. The default is pre that means if a sentence is shorter than maxlen then the padded sequence will start with zeros to fill it. If you want the zeros to the end of the sequence add to the pad_sequences the option padding='post'.

Vehicular answered 6/8, 2018 at 18:13 Comment(0)
C
2

You should call the method like this:

new_sample = ['A new sample to be classified']
seq = tokenizer.texts_to_sequences(new_sample )
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = model.predict(padded)
Crocus answered 14/2, 2020 at 15:38 Comment(0)
D
1

You should try calling like this:

sequences = tokenizer.texts_to_sequences(["physics is nice"])

Defector answered 13/6, 2019 at 16:26 Comment(0)
C
0

when you use, Pads sequences to the same length i.e in your case to the num_words=vocabulary_size, that is why you are getting the output, Just try with : tokenizer.texts_to_sequences , this will give you a sequence of the words. read more about padding, it is just used to match every row of your data, that islets take an extreme of 2 sentences. sentence 1 and sentence 2, sentanec1 has length of 5, while sentence 2 has a length of 8. now when we do send our data for training if we don't pad the sentence1 with 3 then we cannot perform batch Wiese training. Hope it helps

Craquelure answered 31/12, 2019 at 14:18 Comment(0)
N
0

You can pass like below to get the output.

twt = ['He is a lazy person.']
twt = tokenizer.texts_to_sequences(twt)
print (twt)

or

twt = tokenizer.texts_to_sequences(['He is a lazy person.'])
print (twt)
Niles answered 23/5, 2020 at 15:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.