pythonmachine-learningpytorchconv-neural-networkscheduler

why my PyTorch scheduler doesn't seem to work properly?


I'm trying to train a mobileNetV3Large with a simple PyTorch Scheduler. This is the portion of the code responsible for training:

bench_val_loss = 1000
bench_acc = 0.0
epochs = 15
optimizer = optim.Adam(embeddingNet.parameters(), lr=1e-3) 
loss_optimizer = torch.optim.Adam(loss_fn.parameters(), lr=1e-3)

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=3, threshold=0.02)

for epoch in range(1, epochs + 1):

    print(f'current lr: {scheduler.get_last_lr()}')
    loss=train(embeddingNet, loss_fn, device, train_dataloader, optimizer, loss_optimizer, epoch)
    val_loss, accuracy =test(train_dataset, val_dataset, embeddingNet, accuracy_calculator, loss_fn, epoch, val_dataloader)
    #val_loss = simpleTest(train_dataset, val_dataset, embeddingNet, accuracy_calculator, loss_fn, epoch, val_dataloader)

    
    torch.save(embeddingNet.state_dict(), 'my/path/mobileNetV3L_ArcFaceLAST.pth')

    if accuracy >= bench_acc:
      bench_val_loss = val_loss
      torch.save(embeddingNet.state_dict(), 'my/path/mobileNetV3L_ArcFaceBEST.pth')

    scheduler.step(accuracy)

    writer.add_scalars('Training vs. Validation Loss',
                       {'Training': loss, 'Validation': val_loss},
                       global_step=epoch+1)

And here you'll find the first 7 training logs

Test set accuracy (Precision@1) = 0.17834772304046048
current lr: [0.001]
Epoch 3: Loss = 39.68284225463867
Epoch 3: valLoss = 39.9765007019043
100%|██████████| 962/962 [01:43<00:00,  9.28it/s]
100%|██████████| 370/370 [00:41<00:00,  8.92it/s]
Computing accuracy
Test set accuracy (Precision@1) = 0.31242593533096324
current lr: [0.001]
Epoch 4: Loss = 39.4412841796875
Epoch 4: valLoss = 39.67761562450512
100%|██████████| 962/962 [01:45<00:00,  9.11it/s]
100%|██████████| 370/370 [00:41<00:00,  8.86it/s]
Computing accuracy
Test set accuracy (Precision@1) = 0.3633824276282377
current lr: [0.001]
Epoch 5: Loss = 39.09823989868164
Epoch 5: valLoss = 39.54649614901156
100%|██████████| 962/962 [01:42<00:00,  9.37it/s]
100%|██████████| 370/370 [00:41<00:00,  8.87it/s]
Computing accuracy
Test set accuracy (Precision@1) = 0.44244117149145085
current lr: [0.001]
Epoch 6: Loss = 38.70449447631836
Epoch 6: valLoss = 39.1865906792718
100%|██████████| 962/962 [01:45<00:00,  9.15it/s]
100%|██████████| 370/370 [00:39<00:00,  9.25it/s]
Computing accuracy
Test set accuracy (Precision@1) = 0.5167597765363129
current lr: [0.0001]

I can't figure out why the scheduler decided to reduce the learning rate even thought the accuracy was increasing more quickly than the threshold.

Where is the error?


Solution

  • When you use ReduceLROnPlateau with mode='min', the learning rate will be reduced when the monitored quantity does not decrease. Since you monitor accuracy, which you want to increase, you should use mode='max'.