machine-learningreinforcement-learningq-learning

iterations and reward in q-learning


In Q-learning, the agents take actions until reaching their goal. The algorithm is executed many times until obtaining convergence. For example, the goal is to obtain a maximum throughput until the end of the time simulation. The simulation time is divided into n equal periods T and the reward varies over time. So, the agents update their states n times at the begenning of each period. In this case, n is considered as the number of steps or iterations? In addition, the update of the Q-value is done after executing the selected action or before the execution (using the reward function which is an approximation of the real reward)?


Solution

  • Firstly, you should known that in reinforcement learning there exists two kinds of tasks, one in which the agent-environment interaction naturally breaks down into a sequence of separate episodes (episodic tasks), and one in which it does not (continuing tasks) [Sutton book ref.].

    The agent's goal is to maximize the total amount of reward it receives (in a simulation or in a real environment). This means maximizing not immediate reward, but cumulative reward in the long run.

    In the case of an episodic task, each episode often has a different a different duration (e.g., if each episode is a chess game, each game usually finishes in a different number of movements).

    The reward function doesn't change, but the reward recived by the agent changes depending on the actions it takes. In Q-learning algorithm, the agent updates the Q-function after each step (not at the beggining of each period/episode).

    According to your definition, n is considered the number of steps per episode (which can vary from one episode to another, as previously stated). The total number of steps is the sum of n along all the episodes. The term 'iterations' maybe refers to the number of episodes in some papers/books, so it's necessary to know the context.

    The update of the Q-function is performed after executing the selected action. Notice that the agent need to execute the current action to observe the reward and the next state.

    The reward function is not an approximation of the real reward. There doesn't exist a real reward. The reward function is designed by the user to 'tell' the agent what the goal is. More on this topic again in the Sutton and Barto book: Section 3.2 Goals and Rewards.