I need to write SGD-Perceptron for digit recognition according to this guidelines:
net
(weighted sum) and find y_j = f(net_j)
, where f(x)
is activation function. Also w_0j
are offset weights (bias weights), x0 = 1
ε = 0.5 * sum((d_j-y_j) ** 2)
, where d
is desired vector (zero vector with 1 across from desired digit, I think) created from selection (y_train
). I also think ε
is called loss function.ε < ε_threshold
, break the cycle.w_ij += -η * δ_j * x_i
, δ_j = -(d_j-y_j) ∙ f'(net_j)
, where η
is learning rate, ∙
operator is inner product, as I think.The problem is perceptron do not learn whatever values of η
, ε_threshold
and offset
(random initial weight offset) I set. I have similar results after every vector, for example:
Desired digit from y_train: 3
(index 3 in following arrays)
Weighted fum for each j: [-2.39 -2.21 -2.254 -2.49 -2.41 -2.16 -2.17 -2.37 -2.35 -2.32]
Error vector (desired-y_pred): [-0.083 -0.098 -0.095 0.924 -0.082 -0.102 -0.101 -0.085 -0.086 -0.088]
Which mean the weights point at all outputs equally probable, although error vector tries to adjust them correctly.
I use sigmoid (σ(x)
) activation function with derivative σ(x)*(1-σ(x))
.
Here is my code:
import random
import numpy as np
from keras.datasets import mnist
from scipy.misc import derivative
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Normalize data
X_train = X_train / 255.0
X_test = X_test / 255.0
# Constants from data
X_vector_len = 28*28
y_vector_len = 10
# Learning parameters
learning_rate = 0.05
offset = 0.03 # Random initial weight offset
e_threshold = 0.01
def activation_function(x):
return 1 / (1 + np.exp(-x))
def acivation_derivative(x):
return activation_function(x) * (1 - activation_function(x))
def train(X_train, y_train):
# Weight initialization
weights = np.random.rand(X_vector_len, y_vector_len) * offset * 2 - offset
bias = np.random.rand(y_vector_len) * offset * 2 - offset
epochs = 0
while True:
# Getting X and D vectors
s = random.randrange(len(X_train))
sample = np.append(X_train[s], [])
desired = np.zeros(y_vector_len, dtype=np.float64)
desired[y_train[s]] = 1
net = np.dot(sample, weights) + bias # Weighted sum
y_pred = activation_function(net)
e_vec = desired - y_pred
e = sum(e_vec ** 2) / 2 # Error ε for current vector
if e < e_threshold:
break
gradient = -e_vec.dot(acivation_derivative(net))
for j in range(y_vector_len):
for i in range(X_vector_len):
weights[i, j] -= learning_rate * gradient * float(sample[i])
bias[j] -= learning_rate * gradient
epochs += 1
return weights, bias, epochs
def test(X_test, y_test, weights, bias):
correct_predictions = 0
for j, text in enumerate(y_test):
sample = np.append(X_train[j], [])
net = np.dot(sample, weights) + bias
print(f"Test {j+1}: Desired {text}, prediction: {net.argmax()} ({net})")
if text == net.argmax():
correct_predictions += 1
return correct_predictions / len(X_test)
if __name__ == "__main__":
weights, bias, epochs = train(X_train, y_train)
print(f"Model trained for {epochs} epochs")
accuracy = test(X_test, y_test, weights, bias)
print(f"Accuracy for test selection: {accuracy * 100:.2f}%")
Searching for mistakes in formulas and reading some implementations and math theory about SGD didn't help me. I only got more questions because implementations are different:
Why neural network does not learn despite the error vector contains right coefficients (positive for desired, negative for other digits)
The gradient δ
is a number as it is an inner product, what is δ_j
? If I should use δ
as δ_j
, doesn't it mix the weights? The value of error_j = d_j - y_j
definitely must affect specific outputs. I tried to use both δ
and δ_j = -(d_j-y_j) * f'(net_j)
(regular product) with same results.
Sigmoid function has range of values = (0, 1)
, leading to the values of error vector either 1-σ(x)
or 0-σ(x)
, in other words, (-1, 1)
. Am I right the error vector is one of the coefficients for weights and it must be positive if the result is correct and vice versa, while other coefficients always non-negative? But how then other activation functions, e.g. f(x) = x
, f(x) = arctg(x)
with different range of values work?
I have used unit step function as activation function in the other algorithm, is it applicable for SGD?
Can I use numpy functions to adjust weights somehow? I need to multiply all w_ij
values by x_i
in rows.
I've found the answers by myself, and I think this detailed answer will be useful to others (see the questions above):
ε_threshold
value was too high. I'd never written SGD before and I tried smaller values of it, but went through this inattentively. Although even 0.03
value leads to >80%
accuracy and the problem was the combination of some mistakes and misunderstandings (as always be).w_j
. The gradient vector is δ = -(d-y)*f'(net)
(in the code two negatives make a positive when adjusting weights).d
. For example, tanh(x)
has range of values = (-1, 1)
, and d
vector should have maximum (1
) for desired value and minimum (-1
) for other values. Such way the range of values of error vector always stays (1-, 1)
. Don't confuse it.0
.np.outer(x, y)
.I chose the value of ε_threshold = 3e-6
as the most accuracy/learning speed optimized with 89-91%
accuracy and 68,000
epochs. This is average accuracy for the MNIST dataset.
Here is the working code and the list of changes:
reshape()
operation instead of np.append()
for each vector.0.01
, ε_threshold
set to 3e-6
.np.outer()
.import random
import numpy as np
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Flatten and normalize data
X_train = X_train.reshape(X_train.shape[0], -1)
X_test = X_test.reshape(X_test.shape[0], -1)
X_train = X_train / 255.0
X_test = X_test / 255.0
# Constants from data
X_vector_len = 28*28
y_vector_len = 10
# Learning parameters
learning_rate = 0.01
offset = 0.03 # Random initial weight offset
e_threshold = 3e-6
def activation_function(x):
return 1 / (1 + np.exp(-x))
def activation_derivative(x):
return activation_function(x) * (1 - activation_function(x))
def train(X_train, y_train):
# Weight initialization
weights = np.random.rand(X_vector_len, y_vector_len) * offset * 2 - offset
bias = np.random.rand(y_vector_len) * offset * 2 - offset
epochs = 0
while True:
# Getting X and D vectors
s = random.randrange(len(X_train))
sample = X_train[s]
desired = np.zeros(y_vector_len, dtype=np.float64)
desired[y_train[s]] = 1
net = np.dot(sample, weights) + bias # Weighted sum
y_pred = activation_function(net)
e_vec = desired - y_pred
e = np.sum(e_vec ** 2) / 2 # Error ε for current vector
if e < e_threshold:
break
delta = e_vec * activation_derivative(net)
weights += learning_rate * np.outer(sample, delta)
bias += learning_rate * delta
epochs += 1
return weights, bias, epochs
def test(X_test, y_test, weights, bias):
correct_predictions = 0
for j, text in enumerate(y_test):
sample = X_test[j]
net = np.dot(sample, weights) + bias
print(f"Test {j+1}: Desired {text}, prediction: {net.argmax()} ({net})")
if text == net.argmax():
correct_predictions += 1
return correct_predictions / len(X_test)
if __name__ == "__main__":
weights, bias, epochs = train(X_train, y_train)
accuracy = test(X_test, y_test, weights, bias)
print(f"Model trained for {epochs} epochs")
print(f"Accuracy for test selection: {accuracy * 100:.2f}%")