[SOLVED] How can LSTM attention have variable length input

How can LSTM attention have variable length input

The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state.

These 2 steps seems to contradict and can't wrap my head around: 1) The number of inputs to a feed forward network needs to be predefined 2) the number of hidden states of the encoder is variable (depends on number of time steps during encoding).

Am I misunderstanding something? Also would training be the same as if I were to train a regular encoder/decoder network or would I have to train the attention mechanism separately?

Thanks in Advance

Solution

I asked myself the same thing today and found this question. I have never implemented an attention mechanism myself, but from this paper it seems a little bit more than just a straight softmax. For each output y_i of the decoder network, a context vector c_i is computed as a weighted sum of the encoder hidden states h₁, ..., h_T:

c_i = α_i1h₁+...+α_iTh_T

The number of time steps T may be different for each sample because the coefficients α_ij are not vector of fixed size. In fact, they are computed by softmax(e_i1, ..., e_iT), where each e_ij is the output of a neural network whose input is the encoder hidden state h_j and the decoder hidden state s_i-1:

e_ij = f(s_i-1, h_j)

Thus, before y_i is computed, this neural network must be evaluated T times, producing T weights α_i1,...,α_iT. Also, this tensorflow impementation might be useful.