Tensorflow: Can't understand ctc_beam_search_decoder() output sequence

I am using Tensorflow's tf.nn.ctc_beam_search_decoder() to decode the output of a RNN doing some many-to-many mapping (i.e., multiple softmax outputs for each network cell).

A simplified version of the network's output and the Beam search decoder is:

import numpy as np
import tensorflow as tf

batch_size = 4
sequence_max_len = 5
num_classes = 3

y_pred = tf.placeholder(tf.float32, shape=(batch_size, sequence_max_len, num_classes))
y_pred_transposed = tf.transpose(y_pred,
                                 perm=[1, 0, 2])  # TF expects dimensions [max_time, batch_size, num_classes]
logits = tf.log(y_pred_transposed)
sequence_lengths = tf.to_int32(tf.fill([batch_size], sequence_max_len))
decoded, log_probabilities = tf.nn.ctc_beam_search_decoder(logits,
                                                           sequence_length=sequence_lengths,
                                                           beam_width=3,
                                                           merge_repeated=False, top_paths=1)

decoded = decoded[0]
decoded_paths = tf.sparse_tensor_to_dense(decoded)  # Shape: [batch_size, max_sequence_len]

with tf.Session() as session:
    tf.global_variables_initializer().run()

    softmax_outputs = np.array([[[0.1, 0.1, 0.8], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1]],
                                [[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
                                [[0.1, 0.7, 0.2], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
                                [[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]]])

    decoded_paths = session.run(decoded_paths, feed_dict = {y_pred: softmax_outputs})
    print(decoded_paths)

The output in this case is:

[[0]
 [1]
 [1]
 [1]]

My understanding is that the output tensor should be of dimensions [batch_size, max_sequence_len], with each row containing the indices of the relevant classes in the found path.

In this case I would expect the output to be similar to:

[[2, 0, 0, 0, 0],
 [2, 2, 2, 2, 2],
 [1, 2, 2, 2, 2],
 [2, 2, 2, 2, 2]]

What am I not understanding about how ctc_beam_search_decoder works?

As indicated in tf.nn.ctc_beam_search_decoder documentation, the shape of the output is not [batch_size, max_sequence_len]. Instead, it is

[batch_size, max_decoded_length[j]]

(with j=0 in your case).

Based on the beginning of section 2 of this paper (which is cited in the github repository), max_decoded_length[0] is bounded from above by max_sequence_len, but they are not necessarily equal. The relevant citation is:

Let S be a set of training examples drawn from a fixed distribution D_{XxZ}. The input space X = (R^m) is the set of all sequences of m dimensional real valued vectors. The target space Z = L* is the set of all sequences over the (finite) alphabet L of labels. In general, we refer to elements of L* as label sequences or labellings. Each example in S consists of a pair of sequences (x, z). The target sequence z = (z1, z2, ..., zU) is at most as long as the input sequence x = (x1, x2, ..., xT ), i.e. U<=T. Since the input and target sequences are not generally the same length, there is no a priori way of aligning them.

In fact, max_decoded_length[0] depends on the specific matrix softmax_outputs. In particular, two such matrices with exactly the same dimensions can result in different max_decoded_length[0].

For example, if you replace the row

softmax_outputs = np.array([[[0.1, 0.1, 0.8], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1]],
                                [[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
                                [[0.1, 0.7, 0.2], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
                                [[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]]])

with the rows

np.random.seed(7)
r=np.random.randint(0,100,size=(4,5,3))
softmax_outputs=r/np.sum(r,2).reshape(4,5,1)

you'll get the output

[[1 0 1]
 [1 0 1]
 [1 0 0]
 [1 0 0]]

(in the above examples, softmax_outputs consists of logits and it is exactly of the same dimensions as the matrix you provided).

On the other hand, changing the seed to np.random.seed(50) gives the output

[[1 0]
 [1 0]
 [1 0]
 [0 1]]

P.S.

Regarding the last part of your question:

In this case I would expect the output to be similar to:
[[2, 0, 0, 0, 0],
 [2, 2, 2, 2, 2],
 [1, 2, 2, 2, 2],
 [2, 2, 2, 2, 2]]

Note that, based on the documentation, num_classes actually represents num_labels + 1. Specifically:

The inputs Tensor's innermost dimension size, num_classes, represents num_labels + 1 classes, where num_labels is the number of true labels, and the largest value (num_classes - 1) is reserved for the blank label.

For example, for a vocabulary containing 3 labels [a, b, c], num_classes = 4 and the labels indexing is {a: 0, b: 1, c: 2, blank: 3}.

So the true labels in your case are 0 and 1, and 2 is reserved for the blank label. The blank label represents the situation of observing no label (section 3.1 here):

A CTC network has a softmax output layer (Bridle, 1990) with one more unit than there are labels in L. The activations of the first |L| units are interpreted as the probabilities of observing the corresponding labels at particular times. The activation of the extra unit is the probability of observing a ‘blank’, or no label. Together, these outputs define the probabilities of all possible ways of aligning all possible label sequences with the input sequence.

Recommended topics

Hot tags