probabilityreinforcement-learningsampling

Probability 0 in Importance Sampling


I have a general question about the methods that use importance sampling in RL. What happens when the probability of either one of the policies is 0?


Solution

  • Assuming

    Then,

    In the words of Barto and Sutton in their Reinforcement Learning book,

    In order to use episodes from b to estimate values for π, we require that 
    every action taken under π is also taken, at least occasionally, under b. 
    That is, we require that π(a|s) > 0 implies b(a|s) > 0. This is called the 
    assumption of coverage.