How should we pad text sequence in keras using pad_sequences?
Asked Answered
E

3

12

I have coded a sequence to sequence learning LSTM in keras myself using the knowledge gained from the web tutorials and my own intuitions. I converted my sample text to sequences and then padded using pad_sequence function in keras.

from keras.preprocessing.text import Tokenizer,base_filter
from keras.preprocessing.sequence import pad_sequences

def shift(seq, n):
    n = n % len(seq)
    return seq[n:] + seq[:n]

txt="abcdefghijklmn"*100

tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")
tk.fit_on_texts(txt)
x = tk.texts_to_sequences(txt)
#shifing to left
y = shift(x,1)

#padding sequence
max_len = 100
max_features=len(tk.word_counts)
X = pad_sequences(x, maxlen=max_len)
Y = pad_sequences(y, maxlen=max_len)

After a carefully inspection I found my padded sequence looks like this

>>> X[0:6]
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7]], dtype=int32)
>>> X
array([[ 0,  0,  0, ...,  0,  0,  1],
       [ 0,  0,  0, ...,  0,  0,  3],
       [ 0,  0,  0, ...,  0,  0,  2],
       ..., 
       [ 0,  0,  0, ...,  0,  0, 13],
       [ 0,  0,  0, ...,  0,  0, 12],
       [ 0,  0,  0, ...,  0,  0, 14]], dtype=int32)

Is the padded sequence suppose to look like this? Except the last column in the array the rest are all zeros. I think I made some mistake in padding the text to sequence and if so can you tell me where I made the error?

Edette answered 2/2, 2017 at 12:49 Comment(0)
C
10

If you want to tokenize by char, you can do it manually, it's not too complex:

First build a vocabulary for your characters:

txt="abcdefghijklmn"*100
vocab_char = {k: (v+1) for k, v in zip(set(txt), range(len(set(txt))))}
vocab_char['<PAD>'] = 0

This will associate a distinct number for every character in your txt. The character with index 0 should be preserved for the padding.

Having the reverse vocabulary will be usefull to decode the output.

rvocab = {v: k for k, v in vocab.items()}

Once you have this, you can first split your text into sequences, say you want to have sequences of length seq_len = 13 :

[[vocab_char[char] for char in txt[i:(i+seq_len)]] for i in range(0,len(txt),seq_len)]

your output will look like :

[[9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4, 3], 
 [14, 9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4],
 ...,
 [2, 1, 5, 13, 11, 4, 3, 14, 9, 12, 6, 10, 8], 
 [7, 2, 1, 5, 13, 11, 4, 3, 14]]

Note that the last sequence doesn't have the same length, you can discard it or pad your sequence to max_len = 13, it will add 0's to it.

You can build your targets Y the same way, by shifting everything by 1. :-)

I hope this helps.

Capitalization answered 7/2, 2017 at 7:55 Comment(0)
F
6

The problem is in this line:

tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")

When you set such split (by " "), due to nature of your data, you'll get each sequence consisting of a single word. That's why your padded sequences have only one non-zero element. To change that try:

txt="a b c d e f g h i j k l m n "*100
Footle answered 2/2, 2017 at 17:52 Comment(8)
Thank you for pointing out the error but what is the best way to solve this. The docs in keras is very vague.Edette
What are your sequences separated with?Venom
my sequence looks something like this abcdefghijklmnabcdefghijklmn.....mn I want to separate it as individual letters 'a b c d e f g h i j k l m n...` that is as characters (char sequence to sequence learning)Edette
Try "" as a spkit.Venom
I already did that but its giving some error ValueError: maketrans arguments must have same length. I believe the problem is with the pad_sequences because with my previous parameters Tokenizer split the characters and converted into sequence >>> x #result [[1], [3], [2], [5],...Edette
Still I am confused I am trying to code a char-rnn thats why I am splitting words into individual characters. For example A Youtube user has uploaded a video showcasing the differences between The Evil Within running with boost mode on PS4 Pro and the base PS4. split this text to its individual characters and not wordsEdette
I don't understand - so you want to split your text into chars or words then?Venom
I want to write a char rnn github.com/fchollet/keras/blob/master/examples/… with least number of lines of codes but it seems difficult. I have no idea what else to do now?Edette
S
0

The argument padding controls padding either before or after each sequence. Use like this:

X = pad_sequences(x, maxlen=max_len, padding='post')
Y = pad_sequences(y, maxlen=max_len, padding='post')
Senn answered 6/1, 2020 at 1:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.