[SOLVED] How does score function help in policy gradient?

How does score function help in policy gradient?

I'm trying to learn policy gradient methods for reinforcement learning but I stuck at the score function part.

While searching for maximum or minimum points in a function, we take the derivative and set it to zero, then look for the points that holds this equation.

In policy gradient methods, we do it by taking the gradient of the expectation of trajectories and we get:

Objective function image

Here I could not get how this gradient of log policy shifts the distribution (through its parameters θ) to increase the scores of its samples mathematically? Don't we look for something that make this objective function's gradient zero as I explained above?

Solution

What you want to maximize is

J(theta) = int( p(tau;theta)*R(tau) )

The integral is over tau (the trajectory) and p(tau;theta) is its probability (i.e., of seeing the sequence state, action, next state, next action, ...), which depends on both the dynamics of the environment and the policy (parameterized by theta). Formally

p(tau;theta) = p(s_0)*pi(a_0|s_0;theta)*P(s_1|s_0,a_0)*pi(a_1|s_1;theta)*P(s_2|s_1,a_1)*...

where P(s'|s,a) is the transition probability given by the dynamics.

Since we cannot control the dynamics, only the policy, we optimize w.r.t. its parameters, and we do it by gradient ascent, meaning that we take the direction given by the gradient. The equation in your image comes from the log-trick df(x)/dx = f(x)*d(logf(x))/dx.

In our case f(x) is p(tau;theta) and we get your equation. Then since we have access only to a finite amount of data (our samples) we approximate the integral with an expectation.

Step after step, you will (ideally) reach a point where the gradient is 0, meaning that you reached a (local) optimum.

You can find a more detailed explanation here.

EDIT

Informally, you can think of learning the policy which increases the probability of seeing high return R(tau). Usually, R(tau) is the cumulative sum of the rewards. For each state-action pair (s,a) you therefore maximize the sum of the rewards you get from executing a in state s and following pi afterwards. Check this great summary for more details (Fig 1).