I have trained a ResNet model and saved its weights to a .pt file as shown below.
## This is file 1 ##
model = resnet50()
optimizer = Adam(model.parameters(), eps=1e-08, lr = 0.001, weight_decay=1e-4, betas=(0.9, 0.999))
criterion = nn.CrossEntropyLoss()
scheduler = lr_scheduler.MultiplicativeLR(optimizer, lr_lambda=lmbda)
train_model(model, criterion, optimizer, scheduler, num_epochs=num_epochs)
torch.save(model.state_dict(), myresnet.pt')
loss, acc, y_pred, y_true = test_model(model, criterion)
I trained a model and achieved a validation accuracy of 95%.
Then, I tested the model on a separate test set, where it achieved an accuracy of 93%.
After these steps, I closed my code files.
Later, I created new empty code script and loaded the saved weights of the model(.pt file) for further use
## This is file 2 ##
model = models.resnet50()
state_dict = torch.load('myresnet.pt')
loss, acc, y_pred, y_true = test_model(model, criterion)
After loading the .pt file and testing with test set data only, the test accuracy seriously decreased to 20.6%
My try
Initially, I suspected that the .pt file was corrupted, so I reran my code multiple times, but the situation remained unchanged.
I copied all the code from file 2 and appended it to file 1, which resulted in the desirable accuracy.
why happend this? this is something to do with dataloader?
the below is my dataloader
batch_size = 4
image_size = [32, 32]
random_seed = int(time.time()//1000)
def random_ratio_3d(): return [randrange(0, 100)/100, randrange(0, 100)/100, randrange(0, 100)/100]
tmp_mean, tmp_std = random_ratio_3d(), random_ratio_3d()
#data_train_path = 'data/train/'
data_test_path = 'data/test/'
#train_dataset = ImageFolder(data_train_path, Compose([Resize(image_size), ToTensor(), Normalize(mean=tmp_mean, std=tmp_std)]))
test_dataset = ImageFolder(data_test_path, Compose([Resize(image_size), ToTensor(), Normalize(mean=tmp_mean, std=tmp_std)]))
#train_idx, valid_idx = train_test_split(list(range(len(train_dataset))), test_size=0.2, random_state=random_seed)
datasets = {}
#datasets['train'] = Subset(train_dataset, train_idx)
#datasets['valid'] = Subset(train_dataset, valid_idx)
datasets['test'] = test_dataset
dataloaders, batch_num = {}, {}
num_workers = 6 # half of cpu core number
#dataloaders['train'] = DataLoader(datasets['train'], batch_size=batch_size, shuffle=True, num_workers=num_workers)
#dataloaders['valid'] = DataLoader(datasets['valid'],batch_size=batch_size, shuffle=True, num_workers=num_workers)
dataloaders['test'] = DataLoader(datasets['test'], batch_size=batch_size, shuffle=True, num_workers=num_workers)
#batch_num['train'], batch_num['valid'], batch_num['test'] = len(dataloaders['train']), len(dataloaders['valid']), len(dataloaders['test'])
batch_num['test'] = len(dataloaders['test'])
We should save the optimizer state along with model state when using ADAM. Adam is an adaptive learning rate method, which means it computes individual learning rates for various parameters.
Can you try the code using this ??
## This is file 1 ##
model = resnet50()
optimizer = Adam(model.parameters(), eps=1e-08, lr = 0.001, weight_decay=1e-4, betas=(0.9, 0.999))
criterion = nn.CrossEntropyLoss()
scheduler = lr_scheduler.MultiplicativeLR(optimizer, lr_lambda=lmbda)
train_model(model, criterion, optimizer, scheduler, num_epochs=num_epochs)
# torch.save(model.state_dict(), myresnet.pt')
'epoch': epochs,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, myresnet.pt)
# Load using this
checkpoint = torch.load(myresnet.pt)
epoch = checkpoint['epoch']
loss = checkpoint['loss']
# model.train() # to continue training
loss, acc, y_pred, y_true = test_model(model, criterion)
Here's the tutorial which is the basis for this and here's an answer which is explaining why we should save optimizer state especially when using ADAM optimiser.