tensorflow machine-learning neural-network gradient-descent mini-batch

Confused usage of dropout in mini-batch gradient descent

My question is in the end.

An example CNN trained with mini-batch GD and used the dropout in the last fully-connected layer (line 60) as

fc1 = tf.layers.dropout(fc1, rate=dropout, training=is_training)

At first I thought the tf.layers.dropout or tf.nn.dropout randomly sets neurons to zero in columns. But I recently found it's not the case. The below piece of code prints what the dropout does. I used the fc0 as a 4 sample x 10 feature matrix, and the fc as the dropped out version.

import tensorflow as tf
import numpy as np

fc0 = tf.random_normal([4, 10])
fc = tf.nn.dropout(fc0, 0.5)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

a, b = sess.run([fc0, fc])
np.savetxt("oo.txt", np.vstack((a, b)), fmt="%.2f", delimiter=",")

And in the output oo.txt (original matrix: line 1-4, dropped out matrix: line 5-8):

0.10,1.69,0.36,-0.53,0.89,0.71,-0.84,0.24,-0.72,-0.44
0.88,0.32,0.58,-0.18,1.57,0.04,0.58,-0.56,-0.66,0.59
-1.65,-1.68,-0.26,-0.09,-1.35,-0.21,1.78,-1.69,-0.47,1.26
-1.52,0.52,-0.99,0.35,0.90,1.17,-0.92,-0.68,-0.27,0.68
0.20,0.00,0.71,-0.00,0.00,0.00,-0.00,0.47,-0.00,-0.87
0.00,0.00,0.00,-0.00,3.15,0.07,1.16,-0.00,-1.32,0.00
-0.00,-3.36,-0.00,-0.17,-0.00,-0.42,3.57,-3.37,-0.00,2.53
-0.00,1.05,-1.99,0.00,1.80,0.00,-0.00,-0.00,-0.55,1.35

My understanding of the proper? dropout is, knocking out p% same units for each sample in a mini-batch or batch gradient descent phase, and the back-propagation updates the weights and biases of the "thinned network". However, in the implementation of the example, the neurons of each sample in one batch were randomly dropped out, as illustrated in the oo.txt line 5 to 8, and for each sample, the "thinned network" is different.

As a comparison, in a stochastic gradient descent case, samples are fed into the neural network one-by-one, and in each iteration, weights of each tf.layers.dropout introduced "thinned network" are updated.

My question is, in the mini-batch or batch training, shouldn't it be implemented to knock out same neurons for all samples in one batch? Maybe by applying one mask to all input batch samples at each iteration? Something like:

# ones: a 1xN all 1s tensor
# mask: a 1xN 0-1 tensor, multiply fc1 by mask with broadcasting along the axis of samples
mask = tf.layers.dropout(ones, rate=dropout, training=is_training)
fc1 = tf.multiply(fc1, mask)

Now I'm thinking the dropout strategy in the example may be a weighted way of updating weights of a certain neuron, that if a neuron is kept in 1 out of 10 samples in a mini-batch, its weights will be updated by alpha * 1/10 * (y_k_hat-y_k) * x_k, compared with alpha * 1/10 * sum[(y_k_hat-y_k) * x_k] for weights of another neuron kept in all 10 samples?

the screenshot from here

Solution

Dropouts are commonly used to prevent overfitting. In this case it would be a huge weight applied to one of the neurons. By randomly making it 0 from time to time, you force the network to use more neurons in determining the outcome. For this to work well you should drop different neurons for each example so that the gradient you compute is more similar to the one you would get without the dropout.

If you were to drop the same neurons for each example in the batch, my guess is that you will have a less stable gradient (might not matter for your application).

In addition dropout up-scales the rest of the values to keep the average activation at about the same level. Without it the network would learn wrong biases or would over-saturate when you turn dropout off.

If you still want the same neurons to be dropped in the batch then apply dropout to a all 1 tensor of shape (1, num_neurons) and then multiply it with the activations.