I'm reading a paper about Fast-RCNN model.
In the paper section 2.3 part of 'SGD hyper-parameters', it said that All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001
Is 'per-layer learning rate' same as 'layer-specific learning rate' that give different learning rate by layers? If so, I can't understand how they('per-layer learning rate' and 'global learning rate') can be apply at the same time?
I found the example of 'layer-specific learning rate' in pytorch.
optim.SGD([
{'params': model.some_layers.parameters()},
{'params': model.some_layers.parameters(), 'lr': 1}
], lr=1e-3, momentum=0.9)
According to paper, Is this the correct approach?
Sorry for may English
The per-layer terminology in that paper is slightly ambiguous. They aren't referring to the layer-specific learning rates.
All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001.
The concerned statement is w.r.t. Caffe framework in which Fast R-CNN was originally written (github link).
They meant that they're setting the learning rate multiplier of weights and biases to be 1 and 2 respectively.
Check any prototxt
file in the repo e.g. CaffeNet/train.prototxt.
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
Thus, the effective learning rate is going to be base_lr*lr_mult
, and here, the base learning rate is 0.001, which is defined in solver.prototxt
.