After reading the research paper on batchnorm and its various descriptions in forums, I am still not clear how the basic computations are performed. The core of my questions is: a vector is normalized with respect to the set to which it belongs; we can thus normalize vectors input to layer 1 using the batch selected from the training set. Each input vector to the next layer needs to be normalized with respect to the set to which it belongs, but how do we get hold of that set?
More precisely, let
BN(X1j, B1), j = 1..n, can be calculated because we know B1. They are inputs to layer 1.
We need BN(X2j, B2), j = 1..n, to input to layer 2, but we do not have B2 readily available. My questions is how to get B2, B3, etc.
We could process each BN(X1j, B1), j = 1..n, by layer 1 and remember the outputs as X2j (that collection will be B2). Then calculate BN(x2j, B2) for each j by normalizing with respect to B2 and input them to layer 2, etc. So the forward pass would consists of many such steps. For simplicity, I have ignored the scale and shift step, as that is not relevant to my question.
Being new to this topic, I would appreciate expert opinion on it.
going by your notation, size of BN(X1j, B1) is N because for each j there will be an output. On this activation function is applied and we get N new values to the next layer. This would be for a neuron in case of feed forward network.