How to convert predicted sequence back to text in keras?
Asked Answered
C

5

27

I have a sequence to sequence learning model which works fine and able to predict some outputs. The problem is I have no idea how to convert the output back to text sequence.

This is my code.

from keras.preprocessing.text import Tokenizer,base_filter
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense

txt1="""What makes this problem difficult is that the sequences can vary in length,
be comprised of a very large vocabulary of input symbols and may require the model 
to learn the long term context or dependencies between symbols in the input sequence."""

#txt1 is used for fitting 
tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")
tk.fit_on_texts(txt1)

#convert text to sequence
t= tk.texts_to_sequences(txt1)

#padding to feed the sequence to keras model
t=pad_sequences(t, maxlen=10)

model = Sequential()
model.add(Dense(10,input_dim=10))
model.add(Dense(10,activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

#predicting new sequcenc
pred=model.predict(t)

#Convert predicted sequence to text
pred=??
Concertino answered 1/2, 2017 at 3:51 Comment(4)
still no answer?Mephitis
@BenUsman Have you found a solution for this issue? I'm experienced the same.Delorenzo
@Delorenzo see posted answerMephitis
@Concertino Maybe you should accept one of the answers to get the post closed.Firstly
S
23

You can use directly the inverse tokenizer.sequences_to_texts function.

    text = tokenizer.sequences_to_texts(<list_of_integer_equivalent_encodings>)

I have tested the above and it works as expected.

PS.: Take extra care to make the argument be the list of the integer encodings and not the One Hot ones.

Scarface answered 23/11, 2018 at 18:30 Comment(2)
Seems to be the most straight forward answer, and if you need to see what it does, try the following line: print(tokenizer.sequences_to_texts([[1]]))Africa
be sure to remove padding(i.e. removing the padding encoding used) and, the encoding for boolean from <list-of-integer-equivalent-encodings> before running text_to_sequence over itTriform
M
22

Here is a solution I found:

reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))
Mephitis answered 12/5, 2017 at 1:28 Comment(0)
F
18

I had to resolve the same problem, so here is how I ended up doing it (inspired by @Ben Usemans reversed dictionary).

# Importing library
from keras.preprocessing.text import Tokenizer

# My texts
texts = ['These are two crazy sentences', 'that I want to convert back and forth']

# Creating a tokenizer
tokenizer = Tokenizer(lower=True)

# Building word indices
tokenizer.fit_on_texts(texts)

# Tokenizing sentences
sentences = tokenizer.texts_to_sequences(texts)

>sentences
>[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11, 12, 13]]

# Creating a reverse dictionary
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

# Function takes a tokenized sentence and returns the words
def sequence_to_text(list_of_indices):
    # Looking up words in dictionary
    words = [reverse_word_map.get(letter) for letter in list_of_indices]
    return(words)

# Creating texts 
my_texts = list(map(sequence_to_text, sentences))

>my_texts
>[['these', 'are', 'two', 'crazy', 'sentences'], ['that', 'i', 'want', 'to', 'convert', 'back', 'and', 'forth']]
Firstly answered 20/11, 2018 at 11:10 Comment(1)
Just an alternative piece of code for reversing word_index order reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])Epitasis
D
4

You can make the dictionary that map index back to character.

index_word = {v: k for k, v in tk.word_index.items()} # map back
seqs = tk.texts_to_sequences(txt1)
words = []
for seq in seqs:
    if len(seq):
        words.append(index_word.get(seq[0]))
    else:
        words.append(' ')
print(''.join(words)) # output

>>> 'what makes this problem difficult is that the sequences can vary in length  
>>> be comprised of a very large vocabulary of input symbols and may require the model  
>>> to learn the long term context or dependencies between symbols in the input sequence '

However, in the question, you're trying to use sequence of characters to predict output of 10 classes which is not the sequence to sequence model. In this case, you cannot just turn prediction (or pred.argmax(axis=1)) back to sequence of characters.

Duster answered 12/5, 2017 at 2:47 Comment(0)
H
0
    p_test = model.predict(data_test).argmax(axis =1)

#Show some misclassified examples
misclassified_idx = np.where(p_test != Ytest)[0]
len(misclassified_idx) 
i= np.random.choice(misclassified_idx)
print((i))
print((df_test[i]))
print('True label %s Predicted label %s' , (Ytest[i], p_test[i]))

df_test is the original text
data_test is sequence of integer 
Heteronomy answered 13/5, 2020 at 3:40 Comment(1)
Please make sure to describe the code you are postingZn

© 2022 - 2024 — McMap. All rights reserved.