python numpy machine-learning deep-learning neural-network

Neural Network built from scratch using numpy isn't learning

I'm building a neural network from scratch using only Python and numpy, It's meant for classifying the MNIST data set, I got everything to work but the network isn't really learning, at epoch 0 it's accuracy is about 12% after 20 epochs, it increases to 14% but then gradually drops to back to around 12% after 40 epochs. So, it's clear that there's something wrong with my Backpropagation (And yes, I tried increasing epochs to 150 but I still get the same results).

I actually followed this video, But I handled dimensions in a different way, which lead to the code being different, He made it so that the rows are the features while the columns are the samples, But I did the opposite, So while backpropagating I had to transpose some arrays to make his algorithm compatible (I think this might be the reason why it's not working).

Loading the data:

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train, x_test = x_train / 255, x_test / 255
x_train, x_test = x_train.reshape(len(x_train), 28 * 28), x_test.reshape(len(x_test), 28 * 28)
print(x_train.shape) # (60000, 784)
print(x_test.shape) # (10000, 784)

Here's the meat of the model:

W1 = np.random.randn(784, 10)
b1 = np.random.randn(10)
W2 = np.random.randn(10, 10)
b2 = np.random.randn(10)

def relu(x, dir=False):
    if dir: return x > 0
    return np.maximum(x, 0)

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=1, keepdims = True)

def one_hot_encode(y):
    y_hot = np.zeros(shape=(len(y), 10))
    for i in range(len(y)):
        y_hot[i][y[i]] = 1
    return y_hot

def loss_function(predictions, true):
    return predictions - true

def predict(x):
    Z1 = x.dot(W1) + b1
    A1 = relu(Z1)
    Z2 = A1.dot(W2) + b2
    A2 = softmax(Z2)
    # The final prediction is A2 at index 3 or -1:
    return Z1, A1, Z2, A2

def get_accuracy(predictions, Y):
    guesses = predictions.argmax(axis=1)
    average = 0
    i = 0
    while i < len(guesses):
        if guesses[i] == Y[i]:
            average += 1
        i += 1
    percent = (average / len(guesses)) * 100
    return percent
    

def train(data, labels, epochs=40, learning_rate=0.1):
    for i in range(epochs):
        labels_one_hot = one_hot_encode(labels)

        # Forward:
        m = len(labels_one_hot)
        Z1, A1, Z2, A2 = predict(data)
        
        # I think the error is in this chunk:
        # backwards pass: 
        dZ2 = A2 - labels_one_hot
        dW2 = 1 / m * dZ2.T.dot(A1)
        db2 = 1 / m * np.sum(dZ2, axis=1)
        dZ1 = W2.dot(dZ2.T).T * relu(Z1, dir=True)
        dW1 = 1 / m * dZ1.T.dot(data)
        db1 = 1 / m * np.sum(dZ1)

        # Update parameters:
        update(learning_rate, dW1, db1, dW2, db2)

        print("Iteration: ", i + 1)
        predictions = predict(data)[-1] # item at -1 is the final prediction.
        print(get_accuracy(predictions, labels))

def update(learning_rate, dW1, db1, dW2, db2):
    global W1, b1, W2, b2
    W1 = W1 - learning_rate * dW1.T # I have to transpose it here.
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2

train(x_train, y_train)

predictions = predict(x_test)[-1]
print(get_accuracy(predictions, y_test)) # The result is about 11.5% accuracy.

Solution

dW*/db* just have the wrong axes.
Because of that the two bias gradients end up with the wrong shape and
your updates trash the weights every step, so the net hovers at chance
(≈ 10 %).

m = x.shape[0]                 # samples in a batch

# ---------- forward ----------
Z1 = x @ W1 + b1               # (m,784)·(784,10) = (m,10)
A1 = np.maximum(Z1, 0)
Z2 = A1 @ W2 + b2              # (m,10)
A2 = softmax(Z2)               # (m,10)

# ---------- backward ----------
dZ2 = A2 - y_hot               # (m,10)

dW2 = A1.T @ dZ2 / m           # (10,10)
db2 = dZ2.sum(0) / m           # (10,)

dZ1 = (dZ2 @ W2.T) * (Z1 > 0)  # (m,10)

dW1 = x.T @ dZ1 / m            # (784,10)
db1 = dZ1.sum(0) / m           # (10,)

# ---------- SGD step ----------
W2 -= lr * dW2;  b2 -= lr * db2
W1 -= lr * dW1;  b1 -= lr * db1

(Notice the .T is always on the left matrix in each product, so no
extra transposes are needed in the update.)

A numerically-safe soft-max helps too:

def softmax(z):
    z = z - z.max(1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(1, keepdims=True)

With these fixes (plus e.g. He initialisation and a smaller learning
rate like 0.01) the same two-layer net reaches ~93 % on MNIST in 15–20
epochs.