pythondynamic-programmingmarkov-chainsstochasticmdptoolbox

OverflowError as I try to use the value-iteration algorithm with mdptoolbox


I set up a simple MDP for a board that has 4 possible states and 4 possible actions. The board and reward setup looks as follows:

enter image description here

Here S4 is the goal state and S2 is the absorbing state. I have defined the transition probability matrices and reward matrice in the code that I wrote to get the optimal value function for this MDP. But as I run the code, I get an error that says: OverflowError: cannot convert float infinity to integer. I could not understand the reason for this.

import mdptoolbox
import numpy as np

transitions = np.array([
    # action 1 (Right)
    [
        [0.1, 0.7, 0.1, 0.1],
        [0.3, 0.3, 0.3, 0.1],
        [0.1, 0.2, 0.2, 0.5],
        [0.1,  0.1,  0.1,  0.7]
    ],
    # action 2 (Down)
    [
        [0.1, 0.4, 0.4, 0.1],
        [0.3, 0.3, 0.3, 0.1],
        [0.4, 0.1, 0.4, 0.1],
        [0.1,  0.1,  0.1,  0.7]
    ],
    # action 3 (Left)
    [
        [0.4, 0.3, 0.2, 0.1],
        [0.2, 0.2, 0.4, 0.2],
        [0.5, 0.1, 0.3, 0.1],
        [0.1,  0.1,  0.1,  0.7]
    ],
    # action 4 (Top)
    [
        [0.1, 0.4, 0.4, 0.1],
        [0.3, 0.3, 0.3, 0.1],
        [0.4, 0.1, 0.4, 0.1],
        [0.1,  0.1,  0.1,  0.7]
    ]
])

rewards = np.array([
    [-1, -100, -1, 1],
    [-1, -100, -1, 1],
    [-1, -100, -1, 1],
    [1, 1, 1, 1]
])


vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, discount=0.5)
vi.setVerbose()
vi.run()

print("Value function:")
print(vi.V)


print("Policy function")
print(vi.policy)

If I change the value of discount to 1 from 0.5, it works fine. What could be the reason for the value iteration not working with discount value 0.5 or any other decimal values?

Update: It looks like there is some issue with my reward matrix. I have not able to write it as I intended it to be. Because if I change some values in the reward matrix, the error disappears.


Solution

  • So it came out that the reward matrix I had defined was incorrect. According to the reward matrix as defined in the picture above, it should be of type (S,A) as given in the documentation, where each row corresponds to a state starting from S1 until S4 and each column corresponds to action starting from A1 until A4. The new reward matrice looks as follows:

    #(S,A)
    rewards = np.array([
        [-1, -1, -1, -1],
        [-100, -100, -100, -100],
        [-1, -1, -1, -1],
        [1, 1, 1, 1]
    ])
    

    It works fine with this. But I am still not sure, what was happening inside that led to the overflow error.