reinforcement-learningfunction-approximation

Is this example of off policy correct?


I am reading Sutton and Barto and want to make sure I am clear.

For Off Policy learning can we think of a robot in a particular terrain - say on sand - as the target policy but use the robot's policy for walking in snow as the behaviour policy? We are using our experience of walking on snow to approximate the optimal policy for walking on sand?


Solution

  • Your example works, but I think that it's a bit restrictive. In an off-policy method the behavioral policy is just a function that is used to explore state-action space while another function (the target, as you say) is being optimized. This means that as long as the behavior function is defined on the same domain as the target policy, it doesn't really matter whether it's a random process or whether it is the result of previous learning (e.g. your robot that walks on sand). It explores the state-action space, so it meets the definition. Whether it does it well or not is a different story.