python pytorch transformer-model encoder

Annotated Transformer - Why x + DropOut(Sublayer(LayerNorm(x)))?

Please clarify if the Annotated Transformer Encoder LayerNorm implementation is correct.

Transformer paper says the output of the sub layer is LayerNorm(x + Dropout(SubLayer(x))).

LayerNorm should be applied after the DropOut(SubLayer(x)) as per the paper:

However, the Annotated Transformer implementation does x + DropOut(SubLayer(LayerNorm(x))) where LayerNorm is applied before Sublayer, which is the other way around.

class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))   # <--- LayerNorm before SubLayer

Solution

Original paper applied Dropout to the Sub-Layer (Multi Head Attention) before Residual Connection and Layer Normalization. This is called Post Normalization.

dropout to the output of each sub-layer, before it is added to the sub-layer input (x) and (layer) normalized.

However, recent approach is Pre Normalization where LayerNorm is applied to the input x into the sub-layer as explained in Let's build GPT: from scratch, in code, spelled out.

Very few details about the Transformer have changed in the last five years, but there is something slightly departs from the original paper. You see that Add and Norm is applied after the transformation (Multi Head Attention). But now it is more common to apply LayerNorm before the transformation, so there is a reshuffling of the Layer Norm. This is called pre-norm formulation and that is the one we are going to implement as well.

This is proposed in On Layer Normalization in the Transformer Architecture.

The Annotated Transformer is also following this approach.