I am currently working on a convolutional neural network in python, and I am currently implementing back propagation. As of now I am just implementing it for the output layer, and I just wrote the code to calculate the derivatives of the bias values.
Here is the code:
def back_prop(self, learning_rate, d_values = None, desired_outputs = None):
self.d_biases = np.zeros((desired_outputs.size, self.biases.size))
self.d_weights = np.zeros((desired_outputs.size, *self.weights.shape))
#checks if the layer is an output to determine the method for finding the derivatives
if self.is_output_layer:
#finds the derivatives of all of the biases in the output layer given respects to the loss
self.d_biases = np.array([[float(softmax(self.output)[i][x] - 1) if desired_outputs[i] == x else float(softmax(self.output)[i][x]) for x in range(self.biases.size)] for i in range(desired_outputs.size)])
self.d_biases *= learning_rate
self.d_biases = np.sum(self.d_biases, axis = 0, keepdims = True)
#updates parameters
self.biases += self.d_biases
desired_outputs is an array of the class indices of a training batch and self.output is a 2d array organized by batches x classes. To calculate loss, I use cross entropy and here is the code:
def check_loss(self, target_indices):
target_percents = self.output[range(len(self.output)),target_indices]
target_percents = np.clip(target_percents, 1e-7, 1-1e-7)
self.losses = -np.log(target_percents)
return np.mean(self.losses)
target_indices will be the same variable as desired_outputs.
Also, check_loss() is in a different class than back_prop() but self.output is still the same value for both.
I am using a simple spiral data pattern from the internet with and x and a y input and 3 classes with no hidden layers to test the code but every time I attempt to propagate a batch through the model the loss increases.
I believe it maybe has something to do with how I am calculating the derivatives, but I checked and to my understanding it works.
According to my knowledge, the derivative for a bias within the output layer is softmax activation of that neuron minus the predicted value for that activation but I could be wrong.
Any clue what is wrong?
Edit:
I changed the code to subtract the biases and changed how I calculated the gradients and now the loss decreases as it is supposed to.
Here is the edited code:
def back_prop(self, learning_rate, d_values = None, desired_outputs = None):
#checks if the layer is an output to determine the method for fidning the derivatives
if self.is_output_layer:
#finds the derivatives of all of the biases in the output layer given respects to the loss
self.d_biases = softmax(self.output)
self.d_biases[range(desired_outputs.size), desired_outputs] -= 1
self.d_biases *= learning_rate
self.d_biases = np.sum(self.d_biases, axis = 0, keepdims = True)
#finds the derivatives of all of the weights in the output layer given respects to the loss
#updates parameters
self.biases -= self.d_biases
return self.biases
I am open to any other suggestions for improving it but otherwise it is working as it is supposed to.
The issue lies in your incorrect computation of the gradient for the output layer during backpropagation. When using softmax activation followed by cross-entropy loss, the gradient simplifies to the predicted probabilities (self.output
) minus the one-hot encoded ground truth labels. Your current implementation manually iterates over each class and sample, reapplying softmax and calculating differences, which is both inefficient and prone to numerical instability. Instead, you should directly subtract 1 from the softmax outputs at the target class indices (self.output[range(batch_size), desired_outputs] -= 1
) and normalize over the batch size. This gives the correct gradient for backpropagation. Additionally, ensure that weights and biases are updated using this gradient, scaled by the learning rate. Correcting this will allow the model to learn properly and reduce the loss during training.