pytorchobject-detectiontorchvisionsingle-shot-detector

What does Conv4_x, Conv8_x means in SSD


In the SSD model presented in the paper, it is said that the base network is considered as VGG16 and the extra feature layers are added at the end of it that allows feature maps to be produced at different scales and aspect ratios.

My question is that in the architecture shown in Fig.2 (shown below) in the SSD paper, the convolution layers have notations shown like Conv5_3 , Conv4_3, for the base network Conv8_2, Conv9_2, Conv10_2 for the added features layers.

What does this _2, _3 notation mean in the convolution layers representation?

I have seen the same notations being used in the SSD model description page, where base network VGG16 change to ResNET50 and used notations like Conv5_x, Conv4_x.

What does this _x means for the convolution layer notation? enter image description here

(note): The SSD model and VGG16 model (till considered as base network in SSD) have same layers (see below), but resulted different output feature maps (torchinfo.summary(model,(1, 3, 300, 300)) used)enter image description here VGG16 each layer output feature mapenter image description here, SSD each layer output feature map enter image description here


Solution

  • The paper alone is arguably not very helpful in describing the notation that you are interested in; however, the corresponding code repository adds a bit more information:

    If we look at ssd_pascal.py, for example, we can see where layers of such name are created (starting from line 23):

        out_layer = "conv6_1"
        ConvBNLayer(net, from_layer, out_layer, use_batchnorm, use_relu, 256, 1, 0, 1,
            lr_mult=lr_mult)
    
        from_layer = out_layer
        out_layer = "conv6_2"
        ConvBNLayer(net, from_layer, out_layer, use_batchnorm, use_relu, 512, 3, 1, 2,
            lr_mult=lr_mult)
    

    Now, we should also take a look at the definition of ConvBNLayer in model_libs.py (starting from line 30):

    def ConvBNLayer(net, from_layer, out_layer, use_bn, use_relu, num_output,
        kernel_size, pad, stride, dilation=1, use_scale=True, lr_mult=1,
        conv_prefix='', conv_postfix='', bn_prefix='', bn_postfix='_bn',
        scale_prefix='', scale_postfix='_scale', bias_prefix='', bias_postfix='_bias',
        **bn_params):
    

    Then we can piece this information together and add a bit of guesswork:

    Note that the names and numbers of layers of the figure don't match the names and numbers of ssd_pascal.py (the latter ends after conv9_2, which has a different stride as Conv9_2 in the figure), but the scheme should be the same, assuming that the authors worked with a certain consistency.

    As to your final question: In the SSD description page, where they write e.g.

    • The conv5_x, avgpool, fc and softmax layers were removed from the original classification model.
    • All strides in conv4_x are set to 1x1.

    I assume that "x" simply serves as a placeholder for the suffix _1, _2 and thus should be read as follows: conv5_1 and conv5_2 were removed, all strides in conv4_1 and conv4_2 are set to 1x1.