pythonsarsa

Converting to Python scalars


I am implementing a SARSA reinforcement learning function which chooses an action following the same current policy updates its Q-values.

This throws me the following error:

 TypeError: only size-1 arrays can be converted to Python scalars

 q[s, a] = q[s, a] + eta * (reward + gamma * q[s_, a_] - q[s, a]) ValueError: setting an array element with a sequence.

I am assuming there is a problem with these lines:

q = np.zeros((env.n_states, env.n_actions))

and

q[s, a] = q[s, a] + eta * (reward + gamma * q[s_, a_] - q[s, a])
s, a = s_, a_

This is the entire method:

def sarsa(env, max_episodes, eta, gamma, epsilon, seed=None):
#environments, max number of episodes, initial learning rate, discount factor, exploration factor, seed

random_state = np.random.RandomState(seed)

eta =np.linspace(eta, 0, max_episodes)
epsilon = np.linspace(epsilon, 0, max_episodes)
q = np.zeros((env.n_states, env.n_actions))

rewards = np.zeros(max_episodes)

for i in range(max_episodes):
    print('starting game', i)

observation = env.reset();
s = observation
rand = np.random.random();

a = maxAction(q, s)
done = False
epRewards = 0
while not done:
    observation_, reward, done = env.step(a)
    s_ = observation_
    rand = np.random.random()
    a_ = maxAction(q, s)
    epRewards += reward
    q[s, a] = q[s, a] + eta * (reward + gamma * q[s_, a_] - q[s, a])
    s, a = s_, a_
    epsilon -= 2/(max_episodes)
    rewards[i] = epRewards

policy = q.argmax(axis=1)
value = q.max(axis=1)

return policy, value

Solution

  • After this line:

    eta = np.linspace(eta, 0, max_episodes)
    

    variable eta stores a numpy array, that's why the right-hand side here is a sequence:

    q[s, a] = q[s, a] + eta * (reward + gamma * q[s_, a_] - q[s, a])