machine-learningterminologyreinforcement-learningmarkov-decision-process

What is a policy in reinforcement learning?


I've seen such words as:

A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.

But still didn't fully understand. What exactly is a policy in reinforcement learning?


Solution

  • The definition is correct, though not instantly obvious if you see it for the first time. Let me put it this way: a policy is an agent's strategy.

    For example, imagine a world where a robot moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:

    Obviously, some policies are better than others, and there are multiple ways to assess them, namely state-value function and action-value function. The goal of RL is to learn the best policy. Now the definition should make more sense (note that in the context time is better understood as a state):

    A policy defines the learning agent's way of behaving at a given time.

    Formally

    More formally, we should first define Markov Decision Process (MDP) as a tuple (S, A, P, R, y), where:

    Then, a policy π is a probability distribution over actions given states. That is the likelihood of every action when an agent is in a particular state (of course, I'm skipping a lot of details here). This definition corresponds to the second part of your definition.

    I highly recommend David Silver's RL course available on YouTube. The first two lectures focus particularly on MDPs and policies.