I am trying to plot the overestimation bias of the critics in DDPG and TD3 models. So essentially there is a critic_target and a critic network. I want to understand how does one go about finding the overestimation bias of the critic with the true Q value? and also how to find the true Q value?
I see in the original TD3 paper (https://arxiv.org/pdf/1802.09477.pdf) that the author measures the overestimation bias of the value networks. Can someone guide me in plotting the same during the training phase of my actor-critic model?
Answering my own question: Essentially during the training phase at each evaluation period (example: every 5000 steps) we can call a function to do this which performs as follows. Keep in mind the policy is kept fixed throughout this run.
pseudocode is as follows
import gym
def get_estimation_values(policy,env_name,gamma=0.99):
eval_env = gym.make(env_name)
state,done = eval.env.reset(),False
episode_reward = 0
max_steps=env.max_steps
#for example if there is only one critic like in DDPG
action = policy.actor(state)
estimated_Q = policy.critic(state,action) #This will be the estimated Q value for the starting state s0
#The true Q value is given by :
# Q(s0,a) = r_0 + gamma(Q(s1,a1))
# Q(s1,a1) = r_1 + gamma(Q(s2,a2))
# Q(s2,a2) = r_2 + gamma(Q(s3,a3)) and so on
# Therefore the true Q value can be written in this form:
# True_Q_value = r_0 + gamma(r1 + gamma(r2 + gamma(r3 + ....)))
# True_Q_value = r_0 + gamma*r1 + (gamma^2 * r2) + (gamma^3 *r3 ) .... until terminal state
# code to find true Q
true_Q = 0
for timesteps in range(max_steps):
if(done):
break
#take action according to the current policy until done
action = policy.actor(state) #maybe convert tensor to numpy if required
next_state,reward,done,_ = eval_env.step(action)
episode_reward+=0
true_Q = true_Q+(gamma**t)*reward
return estimated_Q, true_Q