reinforcement-learningpolicydeterministicstable-baselines

Reinforcement learning deterministic policies worse than non deterministic policies


We have a custom reinforcement learning environment within which we run a PPO agent from stable baselines3 for a multi action selection problem. The agent learns as expected but when we evaluate the learned policy from trained agents the agents achieve worse results (i.e. around 50% lower rewards) when we set deterministic=True than with deterministic=False. The goal of the study is to find new policies for a real-world problem and so it would be desirable to find a deterministic policy as this is much better understandable for most people... And it seems counterintuitive that more random actions result in better performance.

The documentation says only "deterministic (bool) – Whether or not to return deterministic actions.". I understand this as deterministic=False means that the actions are drawn from a learned distribution with a certain stochasticity (i.e. one specific state can result in several different actions) and deterministic=True means that the actions are fully based on the learned policy (i.e. one specific state always results in one specific action).

The question is what it says about the agent and / or the environment when the performance is better with deterministic=False than with deterministic=True?


Solution

  • You need to be very careful before making stochastic agents deterministic. This is because they can become unable to achieve certain goals. Consider the following over-simplified example with 8 states:

    |   | # |   | # |   |
    | X |---| G |---| X |
    

    'G' is goal, 'X' is pit, '-' is wall. The '#' states are impossible to fix in a deterministic way. For instance, if the policy at '#' is left then from the two states in the top left the agent will never get to the goal. The strength of stochastic policies is that they can prevent this kind of issue and let the agent find a way to the goal.

    Additionally, the stochasticity of the action should reduce over time to reflect the certainty that a particular action is correct, but of course there could be some states (such as '#' above) where significant uncertainty remains.