machine-learningreinforcement-learningmarkov-decision-processmdpbandit

Why the bandit problem is also called a one-step/state MDP in Reinforcement learning?


What do we mean by 1 step/state MDP(Markov decision process) ?


Solution

  • Let us consider a n action 1 state MDP. Regardless of which action you take, you are going to stay in the same state. You will, though, get a reward that depends only on the action you took. If you wish to maximise the long term reward in this setting, what you need to do is just judge which of n available choices (actions) is the best.

    This is exactly what the bandit problem is.