I've implemented gridworld example from the book Reinforcement Learning - An Introduction, second edition" from Richard S. Sutton and Andrew G. Barto, Chapter 4, sections 4.1 and 4.2, page 80.
Here is my implementation: https://github.com/ozrentk/dynamic-programming-gridworld-playground
The original algorithm seems to have a bug since the value function (mapping) is updated one by one value in the source mapping structure. Why is that incorrect? It means that inside the loop for each s (of set S), in the same evaluation loop pass, the next value of the element s (e.g. s_2 of set S) will be evaluated from the newly evaluated element in that pass (e.g. s_1 of set S), instead of s from the current iteration. This problem is avoided here using the double buffering technique. An additional buffer is used for new values of set S. It also means that the program uses more memory because of that buffer.
I must admit that I'm not 100% sure if this is a bug, or if I misinterpreted the algorithm. Generally, this is the code I'm using:
...
while True:
delta = 0
# NOTE: algorithm modified a bit, additional buffer new_values introduced
# Barto & Sutton seem to have a bug in their algorithm (iterative estimation does not fit figure 4.1)
# Instead of tracking one state value inside a loop, we track entire state value function mapping
# outside that loop. Also note that after that change algorithm consumes more memory.
new_values = [0] * states_n
for s in non_terminal_states:
# Evaluate state value under current policy
next_states = get_next_states(s, policy[s], world_size)
sum_items = [p * (reward + gamma * values[s_next]) for s_next, p in next_states.items()]
new_values[s] = sum(sum_items)
# Track delta
delta = max(delta, abs(values[s] - new_values[s]))
# (now we switch state value function buffer, instead of switching single state value in the loop)
values = new_values
if use_policy_improvement:
# Policy_improvement is done inside improve_policy(), and if new policy is no better than the
# old one, return value of is_policy_stable is True
is_policy_stable, improved_policy = improve_policy()
if is_policy_stable:
print("Policy is stable.")
break
else:
print("- Improving policy... ----------------")
policy = improved_policy
visualize_policy(policy, states, world_size)
# In case we don't track policy improvement, we need to track delta for the convergence sake
if delta < theta:
break
# Track iteration count
k += 1
...
Am I wrong or there is a problem with the policy evaluation part of the algorithm in the book?
The original algorithm is the "asynchronous version" of policy evaluation. And your impletation using two buffer is the "synchronous version". Both are correct.
The "asynchronous version" also converge to the optimal solution(You can find the proof in book Parallel and Distributed Computation: Numerical Methods). And as you may find in the book, it "usually converges faster".
I find that This link provides a good explanation.