[SOLVED] Why do LayerNorm layers in BERT base have 768 (and not 512) weight and bias parameters?

Why do LayerNorm layers in BERT base have 768 (and not 512) weight and bias parameters?

The following will print 768 weight and bias parameters for each LayerNorm layer.

from transformers import BertModel
model = BertModel.from_pretrained('bert-base-uncased')
for name, param in model.named_parameters():
    if 'LayerNorm' in name:
        print(f"Layer: {name}, Parameters: {param.numel()}")

As per this video, mean and std values are computed for each token in the input. And each mean, std pair has its own learned weight and bias in layer normalization. Since BERT takes in a max of 512 tokens, I'd expect a total of 512 weight and bias parameters in LayerNorm layers.

So why is it 768? Is the video incorrect? Is normalization performed for each of the 768 dimensions across all tokens, meaning mean and std values are computed across a max of 512 values?

Solution

After quite a bit of research and code review, I was able to isolate details of layer normalization (LN), an aspect of transformers that's confusing a lot of people.

TL;DR: The assumption I made in my original question that each mean, std pair has its own weight, bias pair is incorrect. In LN, mean and std stats are computed across embedding dimensions of each token, i.e., there are as many mean, std pairs as there are tokens. But the weight and bias values are learned per embedding dimension, i.e., tokens share the weight and bias values during LN. This means, in the case of BERT base, there are a max of 512 mean and std values, and there are 768 weight and bias values.

For complete details, see my answer to this question.