I am implementing a SARSA reinforcement learning function which chooses an action following the same current policy updates its Q-values.
This throws me the following error:
TypeError: only size-1 arrays can be converted to Python scalars
q[s, a] = q[s, a] + eta * (reward + gamma * q[s_, a_] - q[s, a]) ValueError: setting an array element with a sequence.
I am assuming there is a problem with these lines:
q = np.zeros((env.n_states, env.n_actions))
and
q[s, a] = q[s, a] + eta * (reward + gamma * q[s_, a_] - q[s, a])
s, a = s_, a_
This is the entire method:
def sarsa(env, max_episodes, eta, gamma, epsilon, seed=None):
#environments, max number of episodes, initial learning rate, discount factor, exploration factor, seed
random_state = np.random.RandomState(seed)
eta =np.linspace(eta, 0, max_episodes)
epsilon = np.linspace(epsilon, 0, max_episodes)
q = np.zeros((env.n_states, env.n_actions))
rewards = np.zeros(max_episodes)
for i in range(max_episodes):
print('starting game', i)
observation = env.reset();
s = observation
rand = np.random.random();
a = maxAction(q, s)
done = False
epRewards = 0
while not done:
observation_, reward, done = env.step(a)
s_ = observation_
rand = np.random.random()
a_ = maxAction(q, s)
epRewards += reward
q[s, a] = q[s, a] + eta * (reward + gamma * q[s_, a_] - q[s, a])
s, a = s_, a_
epsilon -= 2/(max_episodes)
rewards[i] = epRewards
policy = q.argmax(axis=1)
value = q.max(axis=1)
return policy, value
After this line:
eta = np.linspace(eta, 0, max_episodes)
variable eta
stores a numpy array, that's why the right-hand side here is a sequence:
q[s, a] = q[s, a] + eta * (reward + gamma * q[s_, a_] - q[s, a])