pythonneural-networkgradient-descentstochastic-gradient

Are weights/biases only updated once per mini-Batch?


Im following a neural networks tutorial, and I have a question about the function that updates the weights.

def update_mini_batch(self, mini_batch, eta):
    """Update the network's weights and biases by applying
    gradient descent using backpropagation to a single mini batch.
    The "mini_batch" is a list of tuples "(x, y)", and "eta"
    is the learning rate."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]                #Initialize bias matrix with 0's
    nabla_w = [np.zeros(w.shape) for w in self.weights]               #Initialize weights matrix with 0's
    for x, y in mini_batch:                                           #For tuples in one mini_batch
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)            #Calculate partial derivatives of bias/weights with backpropagation, set them to delta_nabla_b
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] #Generate a list with partial derivatives of bias of every neuron
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] #Generate a list with partial derivatives of weights for every neuron
    self.weights = [w-(eta/len(mini_batch))*nw                        #Update weights according to update rule
                    for w, nw in zip(self.weights, nabla_w)]          #What author does is he zips 2 lists with values he needs (Current weights and partial derivatives), then do computations with them.
    self.biases = [b-(eta/len(mini_batch))*nb                         #Update biases according to update rule
                   for b, nb in zip(self.biases, nabla_b)]

What I don't understand here is that a for loop is used, to compute nabla_b and nabla_w (The partial derivatives of weights/biases). With backpropagation for every training example in the mini-batch, but only update the weights/biases once.

To me it seems like, say we have a mini batch of size 10, we compute the nabla_b and nabla_w 10 times, and after the for-loop finishes the weights and biases update. But doesn't the for-loop reset the nabla_b and nabla_b lists everytime? Why don't we update self.weights and self.biases inside the for-loop?

The neural network works perfectly so I think I am making a small thinking mistake somewhere.

FYI: the relevant part of the tutorial i am following can be found here


Solution

  • The key to understanding how this loop adds to the biases and weights with every training example is to note the evaluation order in Python. Specifically, everything to the right of an = sign is evaluated before it is assigned to the variable to the left of the = sign.

    This is a simpler example that might be easier to understand:

    nabla_b = [0, 0, 0, 0, 0]
    for x in range(10):
        delta_nabla_b = [-1, 2, -3, 4, -5]
        nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
    

    In this example, we only have five scalar biases and a constant gradient for each. At the end of this loop, what is nabla_b? Consider the comprehension expanded using the definition of zip, and remembering that everything to the right of the = sign is evaluated before it is written to the variable name on the left:

    nabla_b = [0, 0, 0, 0, 0]
    for x in range(10):
        # nabla_b is defined outside of this loop
        delta_nabla_b = [-1, 2, -3, 4, -5]
    
        # expand the comprehension and the zip() function
        temp = []
        for i in range(len(nabla_b)):
            temp.append(nabla_b[i] + delta_nabla_b[i])
    
        # now that the RHS is calculated, set it to the LHS
        nabla_b = temp
    

    At this point it should be clear that each element of nabla_b is being summed with each corresponding element of delta_nabla_b in the comprehension, and that result is overwriting nabla_b for the next iteration of the loop.

    So in the tutorial example, nabla_b and nabla_w are sums of partial derivatives that have a gradient added to them once per training example in the minibatch. They technically are reset for every training example, but they are reset to their previous value plus the gradient, which is exactly what you want. A more clear (but less concise) way to write this might have been:

    def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            # expanding the comprehensions
            for i in range(len(nabla_b)):
                nabla_b[i] += delta_nabla_b[i]      # set the value of each element directly
            for i in range(len(nabla_w)):
                nabla_w[i] += delta_nabla_w[i]
        self.weights = [w-(eta/len(mini_batch))*nw  # note that this comprehension uses the same trick
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]