machine-learningneural-networklstmrecurrent-neural-networkattention-model

How can LSTM attention have variable length input


The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state.

These 2 steps seems to contradict and can't wrap my head around: 1) The number of inputs to a feed forward network needs to be predefined 2) the number of hidden states of the encoder is variable (depends on number of time steps during encoding).

Am I misunderstanding something? Also would training be the same as if I were to train a regular encoder/decoder network or would I have to train the attention mechanism separately?

Thanks in Advance


Solution

  • I asked myself the same thing today and found this question. I have never implemented an attention mechanism myself, but from this paper it seems a little bit more than just a straight softmax. For each output yi of the decoder network, a context vector ci is computed as a weighted sum of the encoder hidden states h1, ..., hT:

    ci = αi1h1+...+αiThT

    The number of time steps T may be different for each sample because the coefficients αij are not vector of fixed size. In fact, they are computed by softmax(ei1, ..., eiT), where each eij is the output of a neural network whose input is the encoder hidden state hj and the decoder hidden state si-1:

    eij = f(si-1, hj)

    Thus, before yi is computed, this neural network must be evaluated T times, producing T weights αi1,...,αiT. Also, this tensorflow impementation might be useful.