You have a policy, which is effectively a probability distribution of actions for all my states. A value function determines the best course of actions to achieve highest reward.
So I have a random policy. I get the value function. I update my policy with a new distribution according to the value function. I get a value function of this new updated policy and reevaluate once again.
From this definition I have trouble understanding how value iteration will then work and I think it's from a misunderstanding of what a value function is.
Is a value function not the best course of actions it is just a course of actions that will determine a reward? Does policy iteration simply look for a value function that provides a higher reward than it's current reward and then update immediately which gives a new distribution of actions for my states (a new policy) and then iteratively does this for every one of its states until convergence?
In that case is value iteration looking for the single best possible action at every state in the sequence (as opposed to one that is just better)? I am struggling here to understand why one wouldn't update the policy?
Are my understandings of policy and value function etc correct?
I think my understanding of policy is certainly incorrect: If a policy is simply a distribution over all the possible actions for my states then I'm not entirely sure what "updating" it means. If it's simply updating the distribution then how exactly does value iteration even work if it's working with a "worse" distribution since isn't the policy initially random when initialized? I can't understand how these would converge and be equally good?
You have a policy, which is effectively a probability distribution of actions for all my states.
Yes
A value function determines the best course of actions to achieve highest reward.
No. A value function tells you, for a given policy, what the expected cumulative reward of taking action a
in state s
is.
Forget about value iteration and policy iteration for a moment. The two things you should try to understand are policy evaluation and policy improvement.
In policy evaluation, you figure out the state-value function for a given policy (which tells you your expected cumulative reward for being in a state and then acting according to the policy thereafter). For every state, you look at all the neighboring states and calculate the expected value of the policy in that state (weighted sum of neighbors' values by policy probabilities). You have to loop through all the states doing this over and over. This converges in the limit to the true state-value function for that policy (in practice, you stop when the changes become small).
In policy improvement, you examine a state-value function and ask, in every state, what's the best action I could take according to the value function? The action the current policy takes might not lead to the highest value neighbor. If it doesn't, we can trivially make a better policy by acting in a way to reach a better neighbor. The new policy that results is better (or at worst, the same).
Policy iteration is just repeated policy evaluation and policy improvement.
In value iteration, you truncate the evaluation step. So instead of following the full evaluation process to convergence, you do one step of looking at neighboring states, and instead of taking an expectation under the policy, you do policy improvement immediately by storing the maximum neighboring value. Evaluation and improvement are smudged together. You repeat this smudged step over and over, until the change in the values is very small. The principal idea of why this converges is the same; you're evaluating the policy, then improving it until it can no longer be improved.
There are a bunch of ways that you might go about understanding policy and value iteration. You can read more about this evaluation and improvement framing in Reinforcement Learning: An Introduction 2nd ed.. I've left out some important details about discounting, but hopefully the overall picture is clearer now.