Following the multi-headed attention layer in a BERT encoder block, is layer normalization done separately on the embedding of each token (i.e., one mean and variance per token embedding), or on the concatenated vector of all token embeddings (the same mean and variance for all embeddings)?
I tracked down full details of layer normalization (LN) in BERT here.
Mean and variance are computed per token. But the weight and bias parameters learned in LN are not per token - it's per embedding dimension.