I am trying to simulate a XOR gate using a neural network similar to this:
Now I understand that each neuron has certain number of weights and a bias. I am using a sigmoid function to determine whether a neuron should fire or not in each state (since this uses a sigmoid rather than a step function, I use firing in a loose sense as it actually spits out real values).
I successfully ran the simulation for feed-forwarding part, and now I want to use the backpropagation algorithm to update the weights and train the model. The question is, for each value of x1
and x2
there is a separate result (4 different combinations in total) and under different input pairs, separate error distances (the difference between the desired output and the actual result) could be be computed and subsequently a different set of weight updates will eventually be achieved. This means we would get 4 different sets of weight updates for each separate input pairs by using backpropagation.
How should we decide about the right weight updates?
Say we repeat the back propagation for a single input pair until we converge, but what if we would converge to a different set of weights if we choose another pair of inputs?
Now I understand that each neuron has certain weights. I am using a sigmoid function to determine a neuron should fire or not in each state.
You do not really "decide" this, typical MLP do not "fire", they output real values. There are neural networks which actually fire (like RBMs) but this is a completely different model.
This means we would get 4 different sets of weight updates for each input pairs by using back propagation.
This is actually a feature. Lets start from the beggining. You try to minimize some loss function on your whole training set (in your case - 4 samples), which is of form:
L(theta) = SUM_i l(f(x_i), y_i)
where l
is some loss function, f(x_i) is your current prediction and y_i true value. You do this by gradient descent, thus you try to compute the gradient of L and go against it
grad L(theta) = grad SUM_i l(f(x_i), y_i) = SUM_i grad l(f(x_i), y_i)
what you now call "a single update" is grad l(f(x_i) y_i)
for a single training pair (x_i, y_i)
. Usually you would not use this, but instead you would sum (or taken average) of updates across whole dataset, as this is your true gradient. Howver, in practise this might be computationaly not feasible (training set is usualy quite large), furthermore, it has been shown empirically that more "noise" in training is usually better. Thus another learning technique emerged, called stochastic gradient descent, which, in short words, shows that under some light assumptions (like additive loss function etc.) you can actually do your "small updates" independently, and you will still converge to local minima! In other words - you can do your updates "point-wise" in random order and you will still learn. Will it be always the same solution? No. But this is also true for computing whole gradient - optimization of non-convex functions is nearly always non-deterministic (you find some local solution, not global one).