I have set up a Q-learning problem in R, and would like some help with the theoretical correctness of my approach in framing the problem.
Problem structure For this problem, the environment consists of 10 possible states. When in each state, the agent has 11 potential actions which it can choose from (these actions are the same regardless of the state which the agent is in). Depending on the particular state which the agent is in and the subsequent action which the agent then takes, there is a unique distribution for transition to a next state i.e. the transition probabilities to any next state are dependant on (only) the previous state as well as the action then taken.
Each episode has 9 iterations i.e. the agent can take 9 actions and make 9 transitions before a new episode begins. In each episode, the agent will begin in state 1.
In each episode, after each of the agent's 9 actions, the agent will get a reward which is dependant on the agent's (immediately) previous state and their (immediately) previous action as well as the state which they landed on i.e. the agent's reward structure is dependant on a state-action-state triplet (of which there will be 9 in an episode).
The transition probability matrix of the agent is static, and so is the reward matrix.
I have set up two learning algorithms. In the first, the q-matrix update happens after each action in each episode. In the second, the q-matrix is updated after each episode. The algorithm uses an epsilon greedy learning formula.
The big problem is that in my Q-learning, my agent is not learning. It gets less and less of a reward over time. I have looked into other potential problems such as simple calculation errors, or bugs in code, but I think that the problem lies with the conceptual structure of my q-learning problem.
Questions
There is a problem in your definition of the problem.
Q(s,a)
is the expected utility of taking action a
in state s
and following the optimal policy afterwards.
Expected rewards are different after taking 1, 2 or 9 steps. That means that the reward of being in state s_0
and taking action a_0
is different in step 0
from what you get in step 9
The "state" as you have defined does not ensure you any reward, it is the combination of "state+step" what does it.
To adequate model the problem, you should reframe it and consider the state to be both the 'position'+'step'. You will now have 90 states (10pos*9steps).