I've built a custom reinforcement learning environment
and agent
which is similar to a labyrinth game.
In labyrinth there're 5 possible actions: up, down, left, right, and stay. While if blocked, e.g. agent can't go up, then how do people design env
and agent
to simulate that?
To be specific, the agent is at current state s0
, and by definition taking actions of down, left, and right will change the state to some other values with an immediate reward (>0 if at the exit). One possible approach is when taking action up
, the state will stay at s0
and the reward will be a large negative number. Ideally the agent will learn that and never go up
again at this state.
However, my agent seems not learning this. Instead, it still goes up
. Another approach is to hard code the agent and the environment that the agent will not be able to perform the action up
when at s0
, what I can think of is:
up
is not allowed, we look at the Q values of different actionsup
I'm asking is the above approach feasible? Will there be any issues related to that? Or is there a better design to deal with the boundary and invalid actions?
I would say this should work (but even better than guessing is trying it). Other questions would be: What is the state your agent is able to observe? Are you doing reward clipping?
On the other Hand, if your agent did not learn to avoid running into walls there might be another Problem within your learning Routine (maybe there is a bug in the reward function?)
Hard coded clipping Actions might lead to a behavior which you want to see, but it certainly cuts down the Overall performance of your agent.
Whatelse did you implement? If not done yet, it might be good to take experience replay into account.