I am currently reading Neural Networks and Deep Learning and I am stuck on a problem. The problem is to update the code that he gives to use L1 regularization instead of L2 regularization.
The original piece of code that uses L2 regularization is:
def update_mini_batch(self, mini_batch, eta, lmbda, n):
"""Update the network's weights and biases by applying gradient
descent using backpropagation to a single mini batch. The
``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the
learning rate, ``lmbda`` is the regularization parameter, and
``n`` is the total size of the training data set.
"""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]
where it can be seen that self.weights
is updated using the L2 regularization term. For L1 regularization, I believe that I just have to update that same line to reflect
It is stated in the book that we can estimate the
term using the mini-batch average. This was a confusing statement to me but I thought it meant for each mini-batch to use the average of nabla_w
for each layer. This led me to make the following edits to the code:
def update_mini_batch(self, mini_batch, eta, lmbda, n):
"""Update the network's weights and biases by applying gradient
descent using backpropagation to a single mini batch. The
``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the
learning rate, ``lmbda`` is the regularization parameter, and
``n`` is the total size of the training data set.
"""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
avg_nw = [np.array([[np.average(layer)] * len(layer[0])] * len(layer))
for layer in nabla_w]
self.weights = [(1-eta*(lmbda/n))*w-(eta)*nw
for w, nw in zip(self.weights, avg_nw)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]
but the results I get are pretty much just noise with about 10% accuracy. Am I interpreting the statement wrong or is my code wrong? Any hints would be appreciated.
That's not correct.
Conceptually L2 regularization is saying that we are going to geometrically scale W down by some decay after each training iteration. That way if W becomes really large it will scale down more. This keeps the individual values in W from becoming too large.
Conceptually L1 regularization is saying that we are going to linearly decrease W down by some constant after each training iteration (not to cross zero. Positive numbers are reduced to zero but not below. Negative numbers are increased to zero but not above.) This zeros out very small values of W leaving only values that make a significant contribution.
Your second equation
self.weights = [(1-eta*(lmbda/n))*w-(eta)*nw
for w, nw in zip(self.weights, avg_nw)]
does not implement raw subtraction but still has multiplication (geometric scaling) in (1-eta*(lmbda/n))*w.
Implement some function reduceLinearlyToZero that takes w and eta*(lmbda/n) and returns max( abs( w - eta*(lmbda/n) ) , 0 ) * ( 1.0 if w >= 0 else -1.0 )
def reduceLinearlyToZero(w,eta,lmbda,n) :
return max( abs( w - eta*(lmbda/n) ) , 0 ) * ( 1.0 if w >= 0 else -1.0 )
self.weights = [ reduceLinearlyToZero(w,eta,lmbda,n)-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, avg_nw)]
or possibly
self.weights = [ reduceLinearlyToZero(w-(eta/len(mini_batch))*nw,eta,lmbda,n)
for w, nw in zip(self.weights, avg_nw)]