machine-learningdeep-learningretinanet

How are the FCN heads convolved over RetinaNet's FPN features?


I've recently read the RetinaNet paper and I have yet to understood one minor detail:
We have the multi-scale feature maps obtained from the FPN (P2,...P7).
Then the two FCN heads (the classifier head and regessor head) are convolving each one of the feature maps.
However, each feature map has different spatial scale, so, how does the classifier head and regressor head maintain fixed output volumes, given all their convolution parameters are fix? (i.e. 3x3 filter with stride 1, etc).

Looking at this line at PyTorch's implementation of RetinaNet, I see the heads just convolve each feature and then all features are stacked somehow (the only common dimension between them is the Channel dimension which is 256, but spatially they are double from each other).
Would love to hear how are they combined, I wasn't able to understand that point.


Solution

  • After the convolution at each pyramid step, you reshape the outputs to be of shape (H*W, out_dim) (with out_dim being num_classes * num_anchors for the class head and 4 * num_anchors for the bbox regressor). Finally, you can concatenate the resulting tensors along the H*W dimension, which is now possible because all the other dimensions match, and compute losses as you would on a network with a single feature layer.