I'm learning regularization in Neural networks from deeplearning.ai
course. Here in dropout regularization, the professor says that if dropout is applied, the calculated activation values will be smaller then when the dropout is not applied (while testing). So we need to scale the activations in order to keep the testing phase simpler.
I understood this fact, but I don't understand how scaling is done. Here is a code sample which is used to implement inverted dropout.
keep_prob = 0.8 # 0 <= keep_prob <= 1
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob
a3 = np.multiply(a3,d3) # keep only the values in d3
# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob
In the above code, why the activations are divided by 0.8
or the probability of keeping a node in a layer (keep_prob
)? Any numerical example will help.
I got the answer by myself after spending some time understanding the inverted dropout. Here is the intuition:
We are preserving the neurons in any layer with the probability keep_prob
. Let's say keep_prob = 0.6
. This means to shut down 40% of the neurons in any layer. If the original output of the layer before shutting down 40% of neurons was x
, then after applying 40% dropout, it'll be reduced by 0.4 * x
. So now it will be x - 0.4x = 0.6x
.
To maintain the original output (expected value), we need to divide the output by keep_prob
(or 0.6
here).