Okay, so I have created a neural network Q-learner using the same idea as DeepMind's Atari algorithm (except I give raw data not pictures (yet)).
Neural network build:
9 inputs (0 for empty spot, 1 for "X", -1 for "O")
1 hidden layer with 9-50 neurons (tried with different sizes, activation function sigmoid)
9 outputs (1 for every action, outputs Q-value, activation function sigmoid)
I'm 100% confident network is built correctly because of gradient checks and lots of tests.
Q-parameters:
Problem
All my Q-values go to zero if I give -1 reward when move is made to already occupied spot. If I don't do it the network doesn't learn that it shouldn't make moves to already occupied places and seems to learn arbitrary Q-values. Also my error doesn't seem to shrink.
Solutions that didn't work
I have tried to change rewards to (0, 0.5, 1) and (0, 1) but it still didn't learn.
I have tried to present state as 0 for empty, 0.5 for O, 1 for X, but didn't work.
I have tried to give the next state straight after move is made but it didn't help.
I have tried with Adam and vanilla back prop, but still same results.
Project in GitHub: https://github.com/Dopet/tic-tac-toe (Sorry for ugly code mostly due to all of these refactorings of code, also this was supposed to be easy test to see if the algorithm works)
Main points:
It was a matter of rewards/removing activation function from the output layer. Most of the times I had rewards of [-1, 1] and my output layer activation function was sigmoid which goes from [0, 1]. This resulted the network to always have error when rewarding it with -1 because the output can never be less than zero. This caused the values go to zero since it tried to fix the error but it couldn't