Consider the following example:
batch, sentence_length, embedding_dim = 2, 3, 4
embedding = torch.randn(batch, sentence_length, embedding_dim)
print(embedding)
# Output:
tensor([[[-2.1918, 1.2574, -0.3838, 1.3870],
[-0.4043, 1.2972, -1.7326, 0.4047],
[ 0.4560, 0.6482, 1.0858, 2.2086]],
[[-1.4964, 0.3722, -0.7766, 0.3062],
[ 0.9812, 0.1709, -0.9177, -1.2558],
[-1.1560, -0.0367, 0.5496, -1.1142]]])
Applying the Layernorm which normalized across the embedding dimension, I get:
layer_norm = torch.nn.LayerNorm(embedding_dim)
layer_norm(embedding)
# Output:
tensor([[[-1.5194, 0.8530, -0.2758, 0.9422],
[-0.2653, 1.2620, -1.4576, 0.4609],
[-0.9470, -0.6641, -0.0204, 1.6315]],
[[-1.4058, 0.9872, -0.4840, 0.9026],
[ 1.3933, 0.4803, -0.7463, -1.1273],
[-0.9869, 0.5545, 1.3619, -0.9294]]],
grad_fn=<NativeLayerNormBackward0>)
Now, when I normalize the first vector of above embedding tensor with a naive python implementation, I get:
a = [-2.1918, 1.2574, -0.3838, 1.3870]
mean_a = statistics.mean(a)
var_a = statistics.stdev(a)
eps = 1e-5
d = [ ((i-mean_a)/math.sqrt(var_a + eps)) for i in a]
print(d)
#Output:
[-1.7048934056508998,0.9571791768620398,-0.3094894774404756,1.0572037062293356]
The normalized values are not the same as what I get from PyTorch's Layernorm. Is there something wrong with the way I calculated Layernorm?
What you want is the variance not the standard deviation (the standard deviation is the sqrt of the variance, and you're getting the sqrt in your calculation of d
).
Also, this uses the biased variance (statistics.pvariance). To reproduce the expected results using the statistics module you'll use:
a = [-2.1918, 1.2574, -0.3838, 1.3870]
mean_a = statistics.mean(a)
var_a = statistics.pvariance(a)
eps = 1e-5
d = [ ((i-mean_a)/math.sqrt(var_a + eps)) for i in a]
print(d)
[-1.519391435327454, 0.8530327107709863, -0.2758152854532861, 0.942174010009754]
Another way to verify correct results is:
[[torch.mean(i).item(), torch.var(i, unbiased=False).item()] for i in layer_norm(embedding)]
[[1.9868215517249155e-08, 0.9999885559082031],
[-1.9868215517249155e-08, 0.9999839663505554]]
This shows that the mean and variance of the normalized embeddings are (very close to) 0 and 1, as expected.
relevant doc: "The standard-deviation is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False)."