deep-learningpytorchfaster-rcnn

what is the meaning of 'per-layer learning rate' in Fast R-CNN paper?


I'm reading a paper about Fast-RCNN model.

In the paper section 2.3 part of 'SGD hyper-parameters', it said that All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001


Is 'per-layer learning rate' same as 'layer-specific learning rate' that give different learning rate by layers? If so, I can't understand how they('per-layer learning rate' and 'global learning rate') can be apply at the same time?


I found the example of 'layer-specific learning rate' in pytorch.

optim.SGD([
                {'params': model.some_layers.parameters()},
                {'params': model.some_layers.parameters(), 'lr': 1}
            ], lr=1e-3, momentum=0.9)

According to paper, Is this the correct approach?


Sorry for may English


Solution

  • The per-layer terminology in that paper is slightly ambiguous. They aren't referring to the layer-specific learning rates.

    All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001.

    The concerned statement is w.r.t. Caffe framework in which Fast R-CNN was originally written (github link).

    They meant that they're setting the learning rate multiplier of weights and biases to be 1 and 2 respectively.

    Check any prototxt file in the repo e.g. CaffeNet/train.prototxt.

      param {
        lr_mult: 1
        decay_mult: 1
      }
      param {
        lr_mult: 2
        decay_mult: 0
      }
    

    Thus, the effective learning rate is going to be base_lr*lr_mult, and here, the base learning rate is 0.001, which is defined in solver.prototxt.