pythonneural-networkgradient-descentregularizedmini-batch

Performing L1 regularization on a mini batch update


I am currently reading Neural Networks and Deep Learning and I am stuck on a problem. The problem is to update the code that he gives to use L1 regularization instead of L2 regularization.

The original piece of code that uses L2 regularization is:

def update_mini_batch(self, mini_batch, eta, lmbda, n):
    """Update the network's weights and biases by applying gradient
    descent using backpropagation to a single mini batch.  The
    ``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the
    learning rate, ``lmbda`` is the regularization parameter, and
    ``n`` is the total size of the training data set.

    """
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw
                    for w, nw in zip(self.weights, nabla_w)]
    self.biases = [b-(eta/len(mini_batch))*nb
                   for b, nb in zip(self.biases, nabla_b)]

where it can be seen that self.weights is updated using the L2 regularization term. For L1 regularization, I believe that I just have to update that same line to reflect

weight update for l1 regularization

It is stated in the book that we can estimate the

enter image description here

term using the mini-batch average. This was a confusing statement to me but I thought it meant for each mini-batch to use the average of nabla_w for each layer. This led me to make the following edits to the code:

def update_mini_batch(self, mini_batch, eta, lmbda, n):
    """Update the network's weights and biases by applying gradient
    descent using backpropagation to a single mini batch.  The
    ``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the
    learning rate, ``lmbda`` is the regularization parameter, and
    ``n`` is the total size of the training data set.

    """
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)
        nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    avg_nw = [np.array([[np.average(layer)] * len(layer[0])] * len(layer))
              for layer in nabla_w]
    self.weights = [(1-eta*(lmbda/n))*w-(eta)*nw
                    for w, nw in zip(self.weights, avg_nw)]
    self.biases = [b-(eta/len(mini_batch))*nb
                   for b, nb in zip(self.biases, nabla_b)]

but the results I get are pretty much just noise with about 10% accuracy. Am I interpreting the statement wrong or is my code wrong? Any hints would be appreciated.


Solution

  • That's not correct.

    Conceptually L2 regularization is saying that we are going to geometrically scale W down by some decay after each training iteration. That way if W becomes really large it will scale down more. This keeps the individual values in W from becoming too large.

    Conceptually L1 regularization is saying that we are going to linearly decrease W down by some constant after each training iteration (not to cross zero. Positive numbers are reduced to zero but not below. Negative numbers are increased to zero but not above.) This zeros out very small values of W leaving only values that make a significant contribution.

    Your second equation

    self.weights = [(1-eta*(lmbda/n))*w-(eta)*nw
                    for w, nw in zip(self.weights, avg_nw)]
    

    does not implement raw subtraction but still has multiplication (geometric scaling) in (1-eta*(lmbda/n))*w.

    Implement some function reduceLinearlyToZero that takes w and eta*(lmbda/n) and returns max( abs( w - eta*(lmbda/n) ) , 0 ) * ( 1.0 if w >= 0 else -1.0 )

    def reduceLinearlyToZero(w,eta,lmbda,n) :
        return max( abs( w - eta*(lmbda/n) ) , 0 ) * ( 1.0 if w >= 0 else -1.0 )
    
    
    self.weights = [ reduceLinearlyToZero(w,eta,lmbda,n)-(eta/len(mini_batch))*nw
                    for w, nw in zip(self.weights, avg_nw)]
    

    or possibly

    self.weights = [ reduceLinearlyToZero(w-(eta/len(mini_batch))*nw,eta,lmbda,n)
                    for w, nw in zip(self.weights, avg_nw)]