Im following a neural networks tutorial, and I have a question about the function that updates the weights.
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The "mini_batch" is a list of tuples "(x, y)", and "eta"
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases] #Initialize bias matrix with 0's
nabla_w = [np.zeros(w.shape) for w in self.weights] #Initialize weights matrix with 0's
for x, y in mini_batch: #For tuples in one mini_batch
delta_nabla_b, delta_nabla_w = self.backprop(x, y) #Calculate partial derivatives of bias/weights with backpropagation, set them to delta_nabla_b
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] #Generate a list with partial derivatives of bias of every neuron
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] #Generate a list with partial derivatives of weights for every neuron
self.weights = [w-(eta/len(mini_batch))*nw #Update weights according to update rule
for w, nw in zip(self.weights, nabla_w)] #What author does is he zips 2 lists with values he needs (Current weights and partial derivatives), then do computations with them.
self.biases = [b-(eta/len(mini_batch))*nb #Update biases according to update rule
for b, nb in zip(self.biases, nabla_b)]
What I don't understand here is that a for loop is used, to compute nabla_b and nabla_w (The partial derivatives of weights/biases). With backpropagation for every training example in the mini-batch, but only update the weights/biases once.
To me it seems like, say we have a mini batch of size 10, we compute the nabla_b and nabla_w 10 times, and after the for-loop finishes the weights and biases update. But doesn't the for-loop reset the nabla_b and nabla_b lists everytime? Why don't we update self.weights
and self.biases
inside the for-loop?
The neural network works perfectly so I think I am making a small thinking mistake somewhere.
FYI: the relevant part of the tutorial i am following can be found here
The key to understanding how this loop adds to the biases and weights with every training example is to note the evaluation order in Python. Specifically, everything to the right of an =
sign is evaluated before it is assigned to the variable to the left of the =
sign.
This is a simpler example that might be easier to understand:
nabla_b = [0, 0, 0, 0, 0]
for x in range(10):
delta_nabla_b = [-1, 2, -3, 4, -5]
nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
In this example, we only have five scalar biases and a constant gradient for each. At the end of this loop, what is nabla_b
? Consider the comprehension expanded using the definition of zip
, and remembering that everything to the right of the =
sign is evaluated before it is written to the variable name on the left:
nabla_b = [0, 0, 0, 0, 0]
for x in range(10):
# nabla_b is defined outside of this loop
delta_nabla_b = [-1, 2, -3, 4, -5]
# expand the comprehension and the zip() function
temp = []
for i in range(len(nabla_b)):
temp.append(nabla_b[i] + delta_nabla_b[i])
# now that the RHS is calculated, set it to the LHS
nabla_b = temp
At this point it should be clear that each element of nabla_b
is being summed with each corresponding element of delta_nabla_b
in the comprehension, and that result is overwriting nabla_b
for the next iteration of the loop.
So in the tutorial example, nabla_b
and nabla_w
are sums of partial derivatives that have a gradient added to them once per training example in the minibatch. They technically are reset for every training example, but they are reset to their previous value plus the gradient, which is exactly what you want. A more clear (but less concise) way to write this might have been:
def update_mini_batch(self, mini_batch, eta):
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
# expanding the comprehensions
for i in range(len(nabla_b)):
nabla_b[i] += delta_nabla_b[i] # set the value of each element directly
for i in range(len(nabla_w)):
nabla_w[i] += delta_nabla_w[i]
self.weights = [w-(eta/len(mini_batch))*nw # note that this comprehension uses the same trick
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]