I set up a simple MDP for a board that has 4 possible states and 4 possible actions. The board and reward setup looks as follows:
Here S4
is the goal state and S2
is the absorbing state. I have defined the transition probability matrices and reward matrice in the code that I wrote to get the optimal value function for this MDP. But as I run the code, I get an error that says: OverflowError: cannot convert float infinity to integer
. I could not understand the reason for this.
import mdptoolbox
import numpy as np
transitions = np.array([
# action 1 (Right)
[
[0.1, 0.7, 0.1, 0.1],
[0.3, 0.3, 0.3, 0.1],
[0.1, 0.2, 0.2, 0.5],
[0.1, 0.1, 0.1, 0.7]
],
# action 2 (Down)
[
[0.1, 0.4, 0.4, 0.1],
[0.3, 0.3, 0.3, 0.1],
[0.4, 0.1, 0.4, 0.1],
[0.1, 0.1, 0.1, 0.7]
],
# action 3 (Left)
[
[0.4, 0.3, 0.2, 0.1],
[0.2, 0.2, 0.4, 0.2],
[0.5, 0.1, 0.3, 0.1],
[0.1, 0.1, 0.1, 0.7]
],
# action 4 (Top)
[
[0.1, 0.4, 0.4, 0.1],
[0.3, 0.3, 0.3, 0.1],
[0.4, 0.1, 0.4, 0.1],
[0.1, 0.1, 0.1, 0.7]
]
])
rewards = np.array([
[-1, -100, -1, 1],
[-1, -100, -1, 1],
[-1, -100, -1, 1],
[1, 1, 1, 1]
])
vi = mdptoolbox.mdp.ValueIteration(transitions, rewards, discount=0.5)
vi.setVerbose()
vi.run()
print("Value function:")
print(vi.V)
print("Policy function")
print(vi.policy)
If I change the value of discount
to 1
from 0.5
, it works fine. What could be the reason for the value iteration not working with discount value 0.5
or any other decimal values?
Update: It looks like there is some issue with my reward matrix. I have not able to write it as I intended it to be. Because if I change some values in the reward matrix, the error disappears.
So it came out that the reward matrix I had defined was incorrect. According to the reward matrix as defined in the picture above, it should be of type (S,A)
as given in the documentation, where each row corresponds to a state starting from S1
until S4
and each column corresponds to action starting from A1
until A4
. The new reward matrice looks as follows:
#(S,A)
rewards = np.array([
[-1, -1, -1, -1],
[-100, -100, -100, -100],
[-1, -1, -1, -1],
[1, 1, 1, 1]
])
It works fine with this. But I am still not sure, what was happening inside that led to the overflow error.