python tensorflow object-detection tensorflow-model-analysis

What happens if tf.stop_gradient is not set?

I am reading the faster-rcnn code of tensorflow models. I am confused with the use of tf.stop_gradient.

Consider the following code snippet:

if self._is_training:
    proposal_boxes = tf.stop_gradient(proposal_boxes)
    if not self._hard_example_miner:
    (groundtruth_boxlists, groundtruth_classes_with_background_list, _,
     groundtruth_weights_list
    ) = self._format_groundtruth_data(true_image_shapes)
    (proposal_boxes, proposal_scores,
     num_proposals) = self._sample_box_classifier_batch(
         proposal_boxes, proposal_scores, num_proposals,
         groundtruth_boxlists, groundtruth_classes_with_background_list,
         groundtruth_weights_list)

More code is here. My question is: what happens if tf.stop_gradient is not set for proposal_boxes?

Solution

This is really a good question, because this simple line tf.stop_gradient is very crucial in training faster_rcnn models. Here is why it is needed during training.

Faster_rcnn models are two-staged detectors and the loss function has to fulfill the goal of both stages. In faster_rcnn, the rpn loss as well as fast_rcnn loss both need to be minimized.

Here is what the paper says in section 3.2

Both RPN and Fast R-CNN, trained independently will modify their convlolutional layers in different ways. We therefore need to develop a technique that allows for sharing convolutional layers between the two networks, rather than learning two separate networks.

The paper then describes three training schemes and in the original paper they adopted the first solution -- Alternating training, that is train RPN first and then train Fast-RCNN.

The second scheme is Approximate joint training, it is easy to implement and this scheme is adopted by the API. The Fast R-CNN accepts the input coordinates from the predicted bounding boxes (by rpn), so the Fast R-CNN loss will have gradients w.r.t the bounding boxes coordinates. But in this training scheme those gradients are ignored, which is exactly why tf.stop_gradient is used. The paper reports that this training scheme will reduce the training time by 25-50%.

The third scheme is Non-approximate joint training, so no tf.stop_gradient is needed. The paper reports that having an RoI pooling layer that is differentiable w.r.t the box coordinates is a nontrivial problem.

But why are those gradients ignored?

It turns out the RoI pooling layer is fully differentiable but the main reason to favor scheme two is scheme three will cause it to be unstable early during training.

One of the authors of the API had a really good answer here

Some further reading regarding approximate joint training.