machine-learningdeep-learningnlptransformer-model

Why do Transformers in Natural Language Processing need a stack of encoders?


I am following this blog on transformers

http://jalammar.github.io/illustrated-transformer/

The only thing I don't understand is why there needs to be a stack of encoders or decoders. I understand that the multi-headed attention layers capture different representation spaces of the problem. I don't understand why there needs to be a vertical stack of encoders and decoders. Wouldn't one encoder/decoder layer work?


Solution

  • Stacking layer is what makes any deep learning architecture powerful, using a single encoder/decoder with attention wouldn't be able to capture the complexity needed to model an entire language or archive high accuracy on tasks as complex as language translation, the use of stacks of encoder/decoders allows the network to extract hierarchical features and model complex problems.