machine-learningreinforcement-learningq-learning

Deep reinforcement learning - how to deal with boundaries in action space


I've built a custom reinforcement learning environment and agent which is similar to a labyrinth game.

In labyrinth there're 5 possible actions: up, down, left, right, and stay. While if blocked, e.g. agent can't go up, then how do people design env and agent to simulate that?

To be specific, the agent is at current state s0, and by definition taking actions of down, left, and right will change the state to some other values with an immediate reward (>0 if at the exit). One possible approach is when taking action up, the state will stay at s0 and the reward will be a large negative number. Ideally the agent will learn that and never go up again at this state.

However, my agent seems not learning this. Instead, it still goes up. Another approach is to hard code the agent and the environment that the agent will not be able to perform the action up when at s0, what I can think of is:

  1. when at some state when up is not allowed, we look at the Q values of different actions
  2. pick the action with the largest Q value except up
  3. therefore, the agent will never perform an invalid action

I'm asking is the above approach feasible? Will there be any issues related to that? Or is there a better design to deal with the boundary and invalid actions?


Solution

  • I would say this should work (but even better than guessing is trying it). Other questions would be: What is the state your agent is able to observe? Are you doing reward clipping?

    On the other Hand, if your agent did not learn to avoid running into walls there might be another Problem within your learning Routine (maybe there is a bug in the reward function?)

    Hard coded clipping Actions might lead to a behavior which you want to see, but it certainly cuts down the Overall performance of your agent.

    Whatelse did you implement? If not done yet, it might be good to take experience replay into account.