pythonnumpymachine-learningmathstochastic-gradient

Stochastic Gradient Descent algorithm does not work properly (Python)


I need to write SGD-Perceptron for digit recognition according to this guidelines:

  1. The selection is MNIST database with 60000 training samples.
  2. Get random vector and calculate net (weighted sum) and find y_j = f(net_j), where f(x) is activation function. Also w_0j are offset weights (bias weights), x0 = 1
  3. Find ε = 0.5 * sum((d_j-y_j) ** 2), where d is desired vector (zero vector with 1 across from desired digit, I think) created from selection (y_train). I also think ε is called loss function.
  4. If ε < ε_threshold, break the cycle.
  5. Adjust the weights: w_ij += -η * δ_j * x_i, δ_j = -(d_j-y_j) ∙ f'(net_j), where η is learning rate, operator is inner product, as I think.
  6. Repeat steps 2-5 until the cycle breaks.

The problem is perceptron do not learn whatever values of η, ε_threshold and offset (random initial weight offset) I set. I have similar results after every vector, for example:

Desired digit from y_train: 3 (index 3 in following arrays)
Weighted fum for each j: [-2.39 -2.21 -2.254 -2.49 -2.41 -2.16 -2.17 -2.37 -2.35 -2.32]
Error vector (desired-y_pred): [-0.083 -0.098 -0.095 0.924 -0.082 -0.102 -0.101 -0.085 -0.086 -0.088]

Which mean the weights point at all outputs equally probable, although error vector tries to adjust them correctly.

I use sigmoid (σ(x)) activation function with derivative σ(x)*(1-σ(x)).

Here is my code:

import random
import numpy as np
from keras.datasets import mnist
from scipy.misc import derivative

(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Normalize data
X_train = X_train / 255.0
X_test = X_test / 255.0

# Constants from data
X_vector_len = 28*28
y_vector_len = 10

# Learning parameters
learning_rate = 0.05
offset = 0.03  # Random initial weight offset
e_threshold = 0.01


def activation_function(x):
    return 1 / (1 + np.exp(-x))


def acivation_derivative(x):
    return activation_function(x) * (1 - activation_function(x))


def train(X_train, y_train):
    # Weight initialization
    weights = np.random.rand(X_vector_len, y_vector_len) * offset * 2 - offset
    bias = np.random.rand(y_vector_len) * offset * 2 - offset

    epochs = 0
    while True:
        # Getting X and D vectors
        s = random.randrange(len(X_train))
        sample = np.append(X_train[s], [])
        desired = np.zeros(y_vector_len, dtype=np.float64)
        desired[y_train[s]] = 1

        net = np.dot(sample, weights) + bias  # Weighted sum
        y_pred = activation_function(net) 
        e_vec = desired - y_pred
        e = sum(e_vec ** 2) / 2  # Error ε for current vector

        if e < e_threshold:
            break

        gradient = -e_vec.dot(acivation_derivative(net))
        for j in range(y_vector_len):
            for i in range(X_vector_len):
                weights[i, j] -= learning_rate * gradient * float(sample[i])
            bias[j] -= learning_rate * gradient
        epochs += 1
    return weights, bias, epochs


def test(X_test, y_test, weights, bias):
    correct_predictions = 0
    for j, text in enumerate(y_test):
        sample = np.append(X_train[j], [])
        net = np.dot(sample, weights) + bias
        print(f"Test {j+1}: Desired {text}, prediction: {net.argmax()} ({net})")
        if text == net.argmax():
            correct_predictions += 1
    return correct_predictions / len(X_test)


if __name__ == "__main__":
    weights, bias, epochs = train(X_train, y_train)
    print(f"Model trained for {epochs} epochs")

    accuracy = test(X_test, y_test, weights, bias)
    print(f"Accuracy for test selection: {accuracy * 100:.2f}%")

Searching for mistakes in formulas and reading some implementations and math theory about SGD didn't help me. I only got more questions because implementations are different:

  1. Why neural network does not learn despite the error vector contains right coefficients (positive for desired, negative for other digits)

  2. The gradient δ is a number as it is an inner product, what is δ_j? If I should use δ as δ_j, doesn't it mix the weights? The value of error_j = d_j - y_j definitely must affect specific outputs. I tried to use both δ and δ_j = -(d_j-y_j) * f'(net_j) (regular product) with same results.

  3. Sigmoid function has range of values = (0, 1), leading to the values of error vector either 1-σ(x) or 0-σ(x), in other words, (-1, 1). Am I right the error vector is one of the coefficients for weights and it must be positive if the result is correct and vice versa, while other coefficients always non-negative? But how then other activation functions, e.g. f(x) = x, f(x) = arctg(x) with different range of values work?

  4. I have used unit step function as activation function in the other algorithm, is it applicable for SGD?

  5. Can I use numpy functions to adjust weights somehow? I need to multiply all w_ij values by x_i in rows.


Solution

  • I've found the answers by myself, and I think this detailed answer will be useful to others (see the questions above):

    1. The ε_threshold value was too high. I'd never written SGD before and I tried smaller values of it, but went through this inattentively. Although even 0.03 value leads to >80% accuracy and the problem was the combination of some mistakes and misunderstandings (as always be).
    2. Yes, the error for individual output must affect appropriate row of weights w_j. The gradient vector is δ = -(d-y)*f'(net) (in the code two negatives make a positive when adjusting weights).
    3. If you want to use another activation function, you should correct the initial desired vector d. For example, tanh(x) has range of values = (-1, 1), and d vector should have maximum (1) for desired value and minimum (-1) for other values. Such way the range of values of error vector always stays (1-, 1). Don't confuse it.
    4. Unit step function is a threshold function, not activation function. At first, it is rarely used in neural networks, and, at second, it can't be used in SGD as the derivative of simple unit step function is 0.
    5. Sure, we can use np.outer(x, y).

    I chose the value of ε_threshold = 3e-6 as the most accuracy/learning speed optimized with 89-91% accuracy and 68,000 epochs. This is average accuracy for the MNIST dataset.

    Here is the working code and the list of changes:

    1. Data flattens with one reshape() operation instead of np.append() for each vector.
    2. Learning rate set to 0.01, ε_threshold set to 3e-6.
    3. Weights adjust using np.outer().
    4. Typos fix and some changes that does not affect learning process.
    import random
    import numpy as np
    from keras.datasets import mnist
    
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    
    # Flatten and normalize data
    X_train = X_train.reshape(X_train.shape[0], -1)
    X_test = X_test.reshape(X_test.shape[0], -1)
    X_train = X_train / 255.0
    X_test = X_test / 255.0
    
    # Constants from data
    X_vector_len = 28*28
    y_vector_len = 10
    
    # Learning parameters
    learning_rate = 0.01
    offset = 0.03  # Random initial weight offset
    e_threshold = 3e-6
    
    
    def activation_function(x):
        return 1 / (1 + np.exp(-x))
    
    
    def activation_derivative(x):
        return activation_function(x) * (1 - activation_function(x))
    
    
    def train(X_train, y_train):
        # Weight initialization
        weights = np.random.rand(X_vector_len, y_vector_len) * offset * 2 - offset
        bias = np.random.rand(y_vector_len) * offset * 2 - offset
    
        epochs = 0
        while True:
            # Getting X and D vectors
            s = random.randrange(len(X_train))
            sample = X_train[s]
            desired = np.zeros(y_vector_len, dtype=np.float64)
            desired[y_train[s]] = 1
    
            net = np.dot(sample, weights) + bias  # Weighted sum
            y_pred = activation_function(net)
            e_vec = desired - y_pred
            e = np.sum(e_vec ** 2) / 2  # Error ε for current vector
    
            if e < e_threshold:
                break
    
            delta = e_vec * activation_derivative(net)
            weights += learning_rate * np.outer(sample, delta)
            bias += learning_rate * delta
    
            epochs += 1
        return weights, bias, epochs
    
    
    def test(X_test, y_test, weights, bias):
        correct_predictions = 0
        for j, text in enumerate(y_test):
            sample = X_test[j]
            net = np.dot(sample, weights) + bias
            print(f"Test {j+1}: Desired {text}, prediction: {net.argmax()} ({net})")
            if text == net.argmax():
                correct_predictions += 1
        return correct_predictions / len(X_test)
    
    
    if __name__ == "__main__":
        weights, bias, epochs = train(X_train, y_train)
        accuracy = test(X_test, y_test, weights, bias)
    
        print(f"Model trained for {epochs} epochs")
        print(f"Accuracy for test selection: {accuracy * 100:.2f}%")