pythontensorflowkeraskeras-rl

Inverting Gradients in Keras


I'm trying to port the BoundingLayer function from this file to the DDPG.py agent in keras-rl but I'm having some trouble with the implementation.

I modified the get_gradients(loss, params) method in DDPG.py to add this:

action_bounds = [-30, 50]

inverted_grads = []
for g,p in zip(modified_grads, params):
    is_above_upper_bound = K.greater(p, K.constant(action_bounds[1], dtype='float32'))
    is_under_lower_bound = K.less(p, K.constant(action_bounds[0], dtype='float32'))
    is_gradient_positive = K.greater(g, K.constant(0, dtype='float32'))
    is_gradient_negative = K.less(g, K.constant(0, dtype='float32'))

    invert_gradient = tf.logical_or(
        tf.logical_and(is_above_upper_bound, is_gradient_negative),
        tf.logical_and(is_under_lower_bound, is_gradient_positive)
    )

    inverted_grads.extend(K.switch(invert_gradient, -g, g))
modified_grads = inverted_grads[:]

But I get an error about the shape:

ValueError: Shape must be rank 0 but is rank 2 for 'cond/Switch' (op: 'Switch') with input shapes: [2,400], [2,400].

Solution

  • keras-rl "get_gradients" function uses gradients calculated with a combined actor-critic model, but you need the gradient of the critic output wrt the action input to apply the inverting gradients feature.

    I've recently implemented it on a RDPG prototype I'm working on, using keras-rl. Still testing, the code can be optimized and is not bug free for sure, but I've put the inverting gradient to work by modifying some keras-rl lines of code. In order to modify the gradient of the critic output wrt the action input, I've followed the original formula to compute the actor gradient, with the help of this great post from Patrick Emami: http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html.

    I'm putting here the entire "compile" function, redefined in a class that inherits from "DDPAgent", where the inverting gradient feature is implemented.

    def compile(self, optimizer, metrics=[]):
        metrics += [mean_q]
    
        if type(optimizer) in (list, tuple):
            if len(optimizer) != 2:
                raise ValueError('More than two optimizers provided. Please only provide a maximum of two optimizers, the first one for the actor and the second one for the critic.')
            actor_optimizer, critic_optimizer = optimizer
        else:
            actor_optimizer = optimizer
            critic_optimizer = clone_optimizer(optimizer)
        if type(actor_optimizer) is str:
            actor_optimizer = optimizers.get(actor_optimizer)
        if type(critic_optimizer) is str:
            critic_optimizer = optimizers.get(critic_optimizer)
        assert actor_optimizer != critic_optimizer
    
        if len(metrics) == 2 and hasattr(metrics[0], '__len__') and hasattr(metrics[1], '__len__'):
            actor_metrics, critic_metrics = metrics
        else:
            actor_metrics = critic_metrics = metrics
    
        def clipped_error(y_true, y_pred):
            return K.mean(huber_loss(y_true, y_pred, self.delta_clip), axis=-1)
    
        # Compile target networks. We only use them in feed-forward mode, hence we can pass any
        # optimizer and loss since we never use it anyway.
        self.target_actor = clone_model(self.actor, self.custom_model_objects)
        self.target_actor.compile(optimizer='sgd', loss='mse')
        self.target_critic = clone_model(self.critic, self.custom_model_objects)
        self.target_critic.compile(optimizer='sgd', loss='mse')
    
        # We also compile the actor. We never optimize the actor using Keras but instead compute
        # the policy gradient ourselves. However, we need the actor in feed-forward mode, hence
        # we also compile it with any optimzer and
        self.actor.compile(optimizer='sgd', loss='mse')
    
        # Compile the critic.
        if self.target_model_update < 1.:
            # We use the `AdditionalUpdatesOptimizer` to efficiently soft-update the target model.
            critic_updates = get_soft_target_model_updates(self.target_critic, self.critic, self.target_model_update)
            critic_optimizer = AdditionalUpdatesOptimizer(critic_optimizer, critic_updates)
        self.critic.compile(optimizer=critic_optimizer, loss=clipped_error, metrics=critic_metrics)      
    
        clipnorm = getattr(actor_optimizer, 'clipnorm', 0.)
        clipvalue = getattr(actor_optimizer, 'clipvalue', 0.)
    
        critic_gradients_wrt_action_input = tf.gradients(self.critic.output, self.critic_action_input)
        critic_gradients_wrt_action_input = [g / float(self.batch_size) for g in critic_gradients_wrt_action_input]  # since TF sums over the batch
    
        action_bounds = [(-1.,1.) for i in range(self.nb_actions)]
    
        def calculate_inverted_gradient():
            """
            Applies "inverting gradient" feature to the action-value gradients.
            """
            gradient_wrt_action = -critic_gradients_wrt_action_input[0]
    
            inverted_gradients = []
    
            for n in range(self.batch_size):
                inverted_gradient = []
                for i in range(gradient_wrt_action[n].shape[0].value):
                    action = self.critic_action_input[n][i]           
                    is_gradient_negative = K.less(gradient_wrt_action[n][i], K.constant(0, dtype='float32'))       
                    adjust_for_upper_bound = gradient_wrt_action[n][i] * ((action_bounds[i][1] - action) / (action_bounds[i][1] - action_bounds[i][0]))  
                    adjust_for_lower_bound = gradient_wrt_action[n][i] * ((action - action_bounds[i][0]) / (action_bounds[i][1] - action_bounds[i][0]))
                    modified_gradient = K.switch(is_gradient_negative, adjust_for_upper_bound, adjust_for_lower_bound)
                    inverted_gradient.append( modified_gradient )
                inverted_gradients.append(inverted_gradient)
    
            gradient_wrt_action = tf.stack(inverted_gradients)
    
            return gradient_wrt_action
    
        actor_gradients_wrt_weights = tf.gradients(self.actor.output, self.actor.trainable_weights, grad_ys=calculate_inverted_gradient())        
        actor_gradients_wrt_weights = [g / float(self.batch_size) for g in actor_gradients_wrt_weights]  # since TF sums over the batch
    
        def get_gradients(loss, params):
            """ Used by the actor optimizer.
                Returns the gradients to train the actor.
                These gradients are obtained by multiplying the gradients of the actor output w.r.t. its weights
                with the gradients of the critic output w.r.t. its action input. """                                   
    
            # Aplly clipping if defined
            modified_grads = [g for g in actor_gradients_wrt_weights]
    
            if clipnorm > 0.:
                norm = K.sqrt(sum([K.sum(K.square(g)) for g in modified_grads]))
                modified_grads = [optimizers.clip_norm(g, clipnorm, norm) for g in modified_grads]
            if clipvalue > 0.:
                modified_grads = [K.clip(g, -clipvalue, clipvalue) for g in modified_grads]
    
            return modified_grads
    
        actor_optimizer.get_gradients = get_gradients
    
        # get_updates is the optimizer function that changes the weights of the network
        updates = actor_optimizer.get_updates(self.actor.trainable_weights, self.actor.constraints, None)
    
        if self.target_model_update < 1.:
            # Include soft target model updates.
            updates += get_soft_target_model_updates(self.target_actor, self.actor, self.target_model_update)
        updates += self.actor.updates  # include other updates of the actor, e.g. for BN
    
        # Finally, combine it all into a callable function.
        # The inputs will be all the necessary placeholders to compute the gradients (actor and critic inputs)
        inputs = self.actor.inputs[:] + [self.critic_action_input, self.critic_history_input]
        self.actor_train_fn = K.function(inputs, [self.actor.output], updates=updates)
    
        self.actor_optimizer = actor_optimizer
    
        self.compiled = True
    

    When training the actor, you should now pass 3 inputs instead of 2: the observation inputs + the action input (with a prediction from the actor network), so you must also modify the "backward" function. In my case:

            ...
            if self.episode > self.nb_steps_warmup_actor:
                action = self.actor.predict_on_batch(history_batch)
                inputs = [history_batch, action, history_batch]
                actor_train_result = self.actor_train_fn(inputs)
                action_values = actor_train_result[0]
                assert action_values.shape == (self.batch_size, self.nb_actions)
            ...
    

    After that you can have your actor with a linear activation in the output.