[SOLVED] Supervised learning v.s. offline (batch) reinforcement learning

Supervised learning v.s. offline (batch) reinforcement learning

Most materials (e.g., David Silver's online course) I can find offer discussions about the relationship between supervised learning and reinforcement learning. However, it is actually a comparison between supervised learning and online reinforcement learning where the agent runs in the environment (or simulates interactions) to get feedback given limited knowledge about the underlying dynamics.

I am more curious about offline (batch) reinforcement learning where the dataset (collected learning experiences) is given a priori. What are the differences compared to supervised learning then? and what are the similarities they may share?

Solution

I am more curious about the offline (batch) setting for reinforcement learning where the dataset (collected learning experiences) is given a priori. What are the differences compared to supervised learning then ? and what are the similarities they may share ?

In the online setting, the fundamental difference between supervised learning and reinforcement learning is the need for exploration and the trade-off between exploration/exploitation in RL. However also in the offline setting there are several differences which makes RL a more difficult/rich problem than supervised learning. A few differences I can think of on the top of my head:

In reinforcement learning the agent receives what is termed "evaluative feedback" in terms of a scalar reward, which gives the agent some feedback of the quality of the action that was taken but it does not tell the agent if this action is the optimal action or not. Contrast this with supervised learning where the agent receives what is termed "instructive feedback": for each prediction that the learner makes, it receives a feedback (a label) that says what the optimal action/prediction was. The differences between instructive and evaluative feedback is detailed in Rich Sutton's book in the first chapters. Essentially reinforcement learning is optimization with sparse labels, for some actions you may not get any feedback at all, and in other cases the feedback may be delayed, which creates the credit-assignment problem.
In reinforcement learning you have a temporal aspect where the goal is to find an optimal policy that maps states to actions over some horizon (number of time-steps). If the horizon T=1, then it is just a one-off prediction problem like in supervised learning, but if T>1 then it is a sequential optimization problem where you have to find the optimal action not just in a single state but in multiple states and this is further complicated by the fact that the actions taken in one state can influence which actions should be taken in future states (i.e. it is dynamic).
In supervised learning there is a fixed i.i.d distribution from which the data points are drawn (this is the common assumption at least). In RL there is no fixed distribution, rather this distribution depends on the policy that is followed and often this distribution is not i.i.d but rather correlated.

Hence, RL is a much richer problem than supervised learning. In fact, it is possible to convert any supervised learning task into a reinforcement learning task: the loss function of the supervised task can be used as to define a reward function, with smaller losses mapping to larger rewards. Although it is not clear why one would want to do this because it converts the supervised problem into a more difficult reinforcement learning problem. Reinforcement learning makes fewer assumptions than supervised learning and is therefore in general a harder problem to solve than supervised learning. However, the opposite is not possible, it is in general not possible to convert a reinforcement learning problem into a supervised learning problem.