I'm trying to implement linear gradient-descent Sarsa based on Sutton & Barto's Book, see the algorithm in the picture below.
However, I struggle to understand something in the algorithm:
I hope anyone can help clarifying this for me :)
w
is the weight vector for the function approximator. The function you are approximating is Q(s,a)
, the action-value function, which tells you the value of taking an action in a state. Its up to you to define the weights, but yes, you're right, you need to think about how you want to represent the actions in the weights. One way might be to define a set of state features, then instantiate them once per action (multiple separate w
vectors). For convenience's sake, you could then concatenate these vectors into one big w
, because you know that only chunks of weight vector that were activated by a state-action pair's features would be updated. Having multiple disjoint sets of state features per action is a lot of weights if the action space is large though, so you might compress multiple actions into different scalar values of a single weight. If the true Q values are close between actions, you'll be able to perform just as well, and you'll actually learn faster because there are fewer weights that need to be optimized. The representation is flexible. It's up to you!
I encourage you to look at the algorithm as written in the second edition of the book (drafts are available from the authors' sites). The notation is clearer. The algorithm you posted is actually a lambda return method, which you can read about in chapter 12 (z
is an eligibility trace, it has the same dimension as w
and isn't critical to the question you are asking). Episodic semi-gradient Sarsa, the same algorithm minus some bells and whistles, appears in section 10.1.