reinforcement-learningexpert-systemmarkov-decision-process

How to solve a deterministic MDP in a non-stationary environment


I am searching for a method to solve a Markov Decision Process (MDP). I know the transition from one state to another is deterministic, but the evironment is non-stationary. This means the reward the agent earns, can be different, when visiting the same state again. Is there an algorithm, like Q-Learning or SARSA, I can use for my problem?


Solution

  • In theory, this will be a very difficult problem. That is, it will be very difficult to find an algorithm with theoretical proofs of convergence to any (optimal) solution.

    In practice, any standard RL algorithm (like those you named) may be just fine, as long as it's not "too non-stationary". With that I mean, it will likely be fine in practice if your environment doesn't change too rapidly/suddenly/often. You may wish to use a slightly higher exploration rate and/or a higher learning rate than you would in a stationary setting, because you need to be able to keep learning and more recent experiences will be more informative than older experiences.