I am currently learning about the Seq2seq translation. I am trying to understand and following PyTorch tutorial from this website "https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#attention-decoder".
In the website, They talk about the Attention technique. I would like to know which technique are they used between Luong and Bahdanau? Another question, Why do they apply Relu layer before GRU cell? Finally, the red box in the figure is called a context vector, right?
I would like to know which technique are they used between Luong and Bahdanau?
Loung is multiplicative, so it should be using Bahdanau (additive attention) as it concats then applies linearity. See http://ruder.io/deep-learning-nlp-best-practices/index.html#attention for more about attention types
Why do they apply RelU layer before GRU cell?
This is the activation after Linear
layer. I think tanh was used originally but ReLU became preferrable.
I think the other ReLU after the embeddings in plain Decoder
is there by mistake though
https://github.com/spro/practical-pytorch/issues/4
the red box in the figure is called a context vector, right?
yes