I have some problems with figuring out why you need to revisit all time steps from an episode on each horizon advance for the On-Line version of the λ-return algorithm from the book:
Reinforcement Learning: An Introduction, 2nd Edition, Chapter 12, Sutton & Barto
Here all sequences of weight vectors W1, W2,..., Wh for each horizon h start from W0(the weights from the end of the previous episode). However they do not seem to depend on the returns/weights from the previous horizon and can be calculated independently. This appears to me explained like that just for clarification and you can calculate them only for the final horizon h=T at episode termination. This will be the same what is done for the Off-line version of the algorithm and the actual update rule is:
Not surprisingly I get exactly the same results for the 2 algorithms on the 19-states Random Walk example:
In the book it is mentioned that the on-line version should perform a little bit better and for that case it should have the same results as the True Online TD(λ). When implementing the latter it really outperforms the off-line version but I can't figure it out for the simple and slow on-line version.
Any suggestions will be appreciated.
Thank you
This appears to me explained like that just for clarification and you can calculate them only for the final horizon h=T at episode termination.
This is not true. The whole point of the online λ-return algorithm is that it is online: it makes updates during the episode. This is crucial in the control setting, when actions selected are determined by the current value estimates. Even in the prediction setting, the weight updates made for earlier horizons have an effect.
This is because the final weight vector from the last horizon is always used in the calculation of the update target, the truncated lambda return. So w_1^1 is used to calculate all targets for h=2, and w_2^2 is used to calculate all targets for h=3. Because the targets are calculated using the latest weight vectors, they are generally more accurate.
Even in the prediction setting, the online lambda return algorithm outperforms the offline version because the targets it uses are better.