
Problem with Deep Sarsa algorithm which work with pytorch (Adam optimizer) but not with keras/Tensorflow (Adam optimizer)

I have a deep sarsa algorithm which work great on Pytorch on lunar-lander-v2 and I would use with Keras/Tensorflow. It use mini-batch of size 64 which are used 128 time to train at each episode.

There are the results I get. As you can see, it work great with Pytorch but not with Keras / Tensorflow... So I think I do not correctly implement the training function is Keras/Tensorflow (code is below).

It seems that loss is oscillating in Keras because epsilon go to early to slow value but it work very great in Pytorch...

Do you see something that could explain why it do not work in Keras/Tensorflow please? Thanks a lot for your help and any idea that could help me...

enter image description here

Network information:

It use Adam optimizer, and a network with two layers : 256 and 128, with relu on each:

class Q_Network(nn.Module):
def __init__(self, state_dim , action_dim):
    super(Q_Network, self).__init__()
    self.x_layer = nn.Linear(state_dim, 256)
    self.h_layer = nn.Linear(256, 128)
    self.y_layer = nn.Linear(128, action_dim)

def forward(self, state):
    xh = F.relu(self.x_layer(state))
    hh = F.relu(self.h_layer(xh))
    state_action_values = self.y_layer(hh)
    return state_action_values

For keras/Tensorflwo I use this one:

def CreationModele(dimension): 
  entree_etat = keras.layers.Input(shape=(dimension))

  sortie = keras.layers.Dense(units=256, activation='relu')(entree_etat)
  sortie = keras.layers.Dense(units=128, activation='relu')(sortie)
  sortie = keras.layers.Dense(units=4)(sortie)

  modele = keras.Model(inputs=entree_etat,outputs=sortie)
  return modele

Training code

In Pytorch, the training is done by:

def update_Sarsa_Network(self, state, next_state, action, next_action, reward, ends):

    actions_values = torch.gather(self.qnet(state), dim=1, index=action.long())

    next_actions_values = torch.gather(self.qnet(next_state), dim=1, index=next_action.long())

    next_actions_values = reward + (1.0 - ends) * (self.discount_factor * next_actions_values)

    q_network_loss = self.MSELoss_function(actions_values, next_actions_values.detach())
    return q_network_loss

And in Keras/Tensorflow by:

mse = keras.losses.MeanSquaredError(

def train(model, batch_next_states_tensor, batch_next_actions_tensor, batch_reward_tensor, batch_end_tensor, batch_states_tensor, batch_actions_tensor, optimizer, gamma):
  with tf.GradientTape() as tape:
    # EStimation des valeurs des actions courantes
    actions_values = model(batch_states_tensor)                                                          # (mini_batch_size,4)
    actions_values = tf.linalg.diag_part(tf.gather(actions_values,batch_actions_tensor,axis=1))         # (mini_batch_size,)
    actions_values = tf.expand_dims(actions_values,-1)                                                  # (mini_batch_size,1)

    # EStimation des valeurs des actions suivantes
    next_actions_values = model(batch_next_states_tensor)                                                          # (mini_batch_size,4)
    next_actions_values = tf.linalg.diag_part(tf.gather(next_actions_values,batch_next_actions_tensor,axis=1))   # (mini_batch_size,)
    cibles = batch_reward_tensor + (1.0 - batch_end_tensor)*gamma*tf.expand_dims(next_actions_values,-1)         # (mini_batch_size,1)

    error = mse(cibles, actions_values)
  grads = tape.gradient(error, model.trainable_variables)
  optimizer.apply_gradients(zip(grads, model.trainable_variables))
  return error

Error function and Optimizer code

The optimizer is Adam in Pytorch and Tensorflow with lr=0.001. In Pytorch:

def __init__(self, state_dim, action_dim):
    self.qnet = Q_Network(state_dim, action_dim)
    self.qnet_optim = torch.optim.Adam(self.qnet.parameters(), lr=0.001)
    self.discount_factor = 0.99
    self.MSELoss_function = nn.MSELoss(reduction='sum')
    self.replay_buffer = ReplayBuffer()

In Keras / Tensorflow:

alpha = 1e-3

# Initialise le modèle
modele_Keras = CreationModele(8)

optimiseur_Keras = keras.optimizers.Adam(learning_rate=alpha)


  • Ok I finnaly foud a solution by de-correlate target and action value using two model, one being updated periodically for target values calculation.

    I use a model for estimating the epsilon-greedy actions and computing the Q(s,a) values and a fixed model (but periodically uptated with the weight of the previous model) for calculate the targer r+gamma*Q(s',a').

    Here is my result : enter image description here