value-iteration

Why is Policy Iteration faster than Value Iteration?


We know that policy iteration gives us the policy directly and hence is faster. But can anyone explain it with some examples.


Solution

  • The reason that policy iteration is faster because - one policy can be represented by an infinite number of value functions, so in policy iteration when you jump from one policy to another .. you have essentially jumped over infinite number of value functions.

    For example:

    p1 = [0, 1, 1]

    is a policy for 3 states and 2 actions, where it chooses action 0 at state 0 and action 1 at states 1 and 2.

    Now, let's consider two value functions:

    v1 = [[0.9, 0.6], [0.6, 0.8], [0.8, 0.9]]

    v2 = [[0.9, 0.6], [0.7, 0.8], [0.6, 0.9]]

    Here, both v1 and v2 map to the same policy, so when you do policy iteration, it's like you don't care about these two as being different value functions because they map to the same policy. So when you update the policy you have essentially discarded a huge number of these value functions, each of which you might have iterated over (in worst case) when doing value iteration.