I have a general question about the methods that use importance sampling in RL. What happens when the probability of either one of the policies is 0?
Assuming
b
= probability of the behaviour policyπ
= probability of the target policyThen,
π
is 0 and b
> 0, then the ratio π / b
becomes 0 which just means the reward arising out of this action from that state is taken to be zero, while making updates to the Q table for the state preceding this. In simple words, this is not a problem and the Monte Carlo algorithm should converge.b
is 0 and π
> 0 should not arise in the first place when we choose a behaviour policy with has "coverage" with the target policy. If we choose a behaviour policy that doesn't have coverage with the target policy, then we simply don't accurately learn action value estimates in the Q table for those state, action
pairs which the behaviour policy simply never explores and we can't expect convergence.In the words of Barto and Sutton in their Reinforcement Learning book,
In order to use episodes from b to estimate values for π, we require that
every action taken under π is also taken, at least occasionally, under b.
That is, we require that π(a|s) > 0 implies b(a|s) > 0. This is called the
assumption of coverage.