I am trying to implement the Episodic Semi-gradient Sarsa for Estimating q described in Sutton's book to solve the Mountain Car Task
. To approximate q
I want to use a neural network
. Therefore, I came up with this code. But sadly my agent is not really learning to solve the task. In some episodes the solution is found very fast (100-200 steps), but sometimes the agent needs more than 30k steps. I think, that I made some elementary mistake in my implementation, but I am not able to find it myself. Can someone help me, and point out the error/mistake in my implementation?
I solved this problem by changing the structure of the network: Instead of using the (state, action)
pair to predict the Q-value
of it, I changed it in the way DQN
does it: I predict the value
of all three possible actions for a given state and then choose the action according to this predictions. I was not able to find the problem with my previous approach, but at least this is now working.