As far as I know, for a specific policy \pi, temporal difference learning let us compute the expected value following that policy \pi, but what's the meaning of knowing a specific policy?
Shouldn't we try finding the optimal policy for a given environment? What's the point of doing a specific \pi using temporal difference learning at all?
As you said, only finding the value function for a given policy is not very useful in the general case, where the goal is finding an optimal policy. However, several classical algorithms such as SARSA
or Q-learning
, can ve viewed as a special case of generalized policy iteration
, where the most difficult part is finding the value function of a policy. Once you know the value function, it's easy to find a better policy, then find again the value function of the recently computed policy, and so on. This process, given some conditions, converges to the optimal policy.
In summary, temporal difference learning
is a key step in other algorthims that allow to find an optimal policy.