deep-learningreinforcement-learningfunction-approximation

How to find the true Q-value and overestimation bias in actor critic


I am trying to plot the overestimation bias of the critics in DDPG and TD3 models. So essentially there is a critic_target and a critic network. I want to understand how does one go about finding the overestimation bias of the critic with the true Q value? and also how to find the true Q value?

I see in the original TD3 paper (https://arxiv.org/pdf/1802.09477.pdf) that the author measures the overestimation bias of the value networks. Can someone guide me in plotting the same during the training phase of my actor-critic model?


Solution

  • Answering my own question: Essentially during the training phase at each evaluation period (example: every 5000 steps) we can call a function to do this which performs as follows. Keep in mind the policy is kept fixed throughout this run.

    pseudocode is as follows

    import gym
    
    def get_estimation_values(policy,env_name,gamma=0.99):
    eval_env = gym.make(env_name)
    state,done = eval.env.reset(),False
    episode_reward = 0
    max_steps=env.max_steps
    
    #for example if there is only one critic like in DDPG
    action = policy.actor(state)
    estimated_Q = policy.critic(state,action) #This will be the estimated Q value for the starting state s0 
    
    #The true Q value is given by : 
    # Q(s0,a) = r_0 + gamma(Q(s1,a1))
    # Q(s1,a1) = r_1 + gamma(Q(s2,a2))
    # Q(s2,a2) = r_2 + gamma(Q(s3,a3)) and so on
    
    # Therefore the true Q value can be written in this form:
    # True_Q_value = r_0 + gamma(r1 + gamma(r2 + gamma(r3 + ....)))
    # True_Q_value = r_0 + gamma*r1 + (gamma^2 * r2) + (gamma^3 *r3 ) .... until terminal state
    
    # code to find true Q
    
    true_Q = 0
    
    for timesteps in range(max_steps):
    
            if(done):
                break
    
    
            #take action according to the current policy until done
            action = policy.actor(state) #maybe convert tensor to numpy if required
            next_state,reward,done,_ = eval_env.step(action)
            episode_reward+=0
    
            true_Q = true_Q+(gamma**t)*reward
    
    return estimated_Q, true_Q