reinforcement-learningmarkov-decision-process

Shaping theorem for MDPs


I need help with understanding the shaping theorem for MDPs. Here's the relevant paper: https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf it basically says that a markov decision process that has some reward function on transitions between states and actions R(s, a, s') has the same optimal policy as a different markov decision process with it's reward defined as R'(s, a, s') = R(s, a, s') + gamma*f(s') - f(s), where gamma is the time-discount-rate.

I understand the proof, but it seems like a trivial case where it breaks down is when R(s, a, s') = 0 for all states and actions, and the agent is faced with the path A -> s -> B versus A -> r -> t -> B. With the original markov process we get an EV of 0 for both paths, so both paths are optimal. But with the potential added to each transition we get, gamma^2*f(B)-f(A) for the first path, and gamma^3*f(B) - f(A) for the second. So if gamma < 1, and 0 < f(B), f(A), then the second path is no longer optimal.

Am I misunderstanding the theorem, or am I making some other mistake?


Solution

  • You are missing the assumption that for every terminal, and starting state s_T, s_0 we have f(s_T) = f(s_0) = 0. (Note, that in the paper there is an assumption that after terminal state there is always the new starting state, and the potential "wraps around).