The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state.
These 2 steps seems to contradict and can't wrap my head around: 1) The number of inputs to a feed forward network needs to be predefined 2) the number of hidden states of the encoder is variable (depends on number of time steps during encoding).
Am I misunderstanding something? Also would training be the same as if I were to train a regular encoder/decoder network or would I have to train the attention mechanism separately?
Thanks in Advance