How can LSTM attention have variable length input

Asked 8/6, 2017 at 18:48 Answered 8/3, 2018 at 3:18

Solved machine-learning neural-network lstm recurrent-neural-network attention-model

The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state.

These 2 steps seems to contradict and can't wrap my head around: 1) The number of inputs to a feed forward network needs to be predefined 2) the number of hidden states of the encoder is variable (depends on number of time steps during encoding).

Am I misunderstanding something? Also would training be the same as if I were to train a regular encoder/decoder network or would I have to train the attention mechanism separately?

Thanks in Advance

Coplanar answered 8/6, 2017 at 18:48 Comment(1)

Here's a nice visualization of attention that I came across: towardsdatascience.com/… – Michaeline 21/5, 2019 at 16:2

I asked myself the same thing today and found this question. I have never implemented an attention mechanism myself, but from this paper it seems a little bit more than just a straight softmax. For each output y_i of the decoder network, a context vector c_i is computed as a weighted sum of the encoder hidden states h₁, ..., h_T:

c_i = α_i1h₁+...+α_iTh_T

The number of time steps T may be different for each sample because the coefficients α_ij are not vector of fixed size. In fact, they are computed by softmax(e_i1, ..., e_iT), where each e_ij is the output of a neural network whose input is the encoder hidden state h_j and the decoder hidden state s_i-1:

e_ij = f(s_i-1, h_j)

Thus, before y_i is computed, this neural network must be evaluated T times, producing T weights α_i1,...,α_iT. Also, this tensorflow impementation might be useful.

Caulescent answered 22/7, 2017 at 2:49 Comment(4)

Congratulations on your first answer, which demonstrates research and is very well formatted! – Torrell 22/7, 2017 at 22:44

I'm still a little confused, given that T is a variable number of inputs. After looking through the paper and the implementation you provided (thanks for that, great answer too by the way!), it seems like the solution is to simply fix an upper limit on the number of time steps T. In order to compute the alpha values, which requires a standard neural network layer transformation, we need to decide on a fixed number of alpha values to output from that transformation. I'd love to get a solid confirmation about this point though. It's been really hard to extrapolate from this paper and others. – Michaeline 1/2, 2018 at 17:43

The output of the neural newtork f is a single coefficient e_ij. This NN is evaluated T times, and T can be arbitrary. The alpha values are the softmax of this T numbers. The sofmax operation takes N numbers and produces N numbers, and N doesn't have to be fixed. Therefore, there's no need for an upper bound on T. I hope I'm getting things right, because I've recently used a Keras attention layer (gist.github.com/cbaziotis/7ef97ccf71cbc14366835198c09809d2) which required a fixed T, so I had to pad the dataset. – Caulescent 2/2, 2018 at 18:14

@DavidParks Here I've written a slightly different explanation, hope it complements this answer. – Rogers 25/4, 2019 at 21:13

def attention(inputs, size, scope):
    with tf.variable_scope(scope or 'attention') as scope:
        attention_context_vector = tf.get_variable(name='attention_context_vector',
                                             shape=[size],
                                             regularizer=layers.l2_regularizer(scale=L2_REG),
                                             dtype=tf.float32)
        input_projection = layers.fully_connected(inputs, size,
                                            activation_fn=tf.tanh,
                                            weights_regularizer=layers.l2_regularizer(scale=L2_REG))
        vector_attn = tf.reduce_sum(tf.multiply(input_projection, attention_context_vector), axis=2, keep_dims=True)
        attention_weights = tf.nn.softmax(vector_attn, dim=1)
        weighted_projection = tf.multiply(inputs, attention_weights)
        outputs = tf.reduce_sum(weighted_projection, axis=1)

return outputs

Hope this piece of codes can help you to understand how attention works。 I use this function in my doc classification jobs, which is a lstm-attention model, different from your encoder-decoder model.

Somber answered 8/3, 2018 at 3:18 Comment(0)

Recommended topics

Hot tags