I'm trying to fine-tune a PyTorch classification model to classify plant disease images. I have properly initialized the CUDA device and sent the model, train, validation, and test data to the device. However, when training the model, it uses 100% CPU and 0% GPU. Why is this happening?
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, patience):
train_losses, val_losses = [], []
train_accuracies, val_accuracies = [], []
best_val_loss = np.inf
patience_counter = 0
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)
for epoch in range(num_epochs):
model.train()
running_loss, running_corrects, total = 0.0, 0, 0
for inputs, labels in tqdm(train_loader):
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item() * inputs.size(0)
_, preds = torch.max(outputs, 1)
running_corrects += torch.sum(preds == labels.data)
total += labels.size(0)
epoch_loss = running_loss / total
epoch_acc = running_corrects.double() / total
train_losses.append(epoch_loss)
train_accuracies.append(epoch_acc.item())
val_loss, val_acc = evaluate_model(model, val_loader, criterion)
val_losses.append(val_loss)
val_accuracies.append(val_acc)
print(f"Epoch {epoch}/{num_epochs-1}, Train Loss: {epoch_loss:.4f}, Train Acc: {epoch_acc:.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pth')
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
print("Early stopping")
break
lr_scheduler.step()
return train_losses, train_accuracies, val_losses, val_accuracies
Here is my notebook: EfficientNet with Augmentation
Edit: I resize the dataset into 256X256 and run the code. When I'm checking my GPU usage surprisingly usage is periodically going up and down. As @kale-kundert said there is a performance bottleneck in data loading and applying augmenting pipeline.
One issue I see is that you aren't using multiple processes to load training examples. If your program is spending way more time loading training examples (done by the CPU) than actually training on them (done by the GPU), that could explain the low GPU utilization.
To be more specific, each time your dataset loads an example, it has to parse a file and apply a bunch of data augmentations. It's hard to be sure by just reading the code, but both of those steps could be pretty expensive.
If this is actually the problem, here are two possible ways to fix it. First is to use multiple dataloader processes. This is really easy to do; just pass the num_workers
argument to the DataLoader
constructor. The only downside with this approach is that you need a lot of CPUs to make the most of it, and cloud providers might not give you very many. Second is to preload the entire dataset. This is only an option for relatively small datasets, but if that applies to you, and if you can cache the results so you don't need to do the full preloading every time, this would probably be the fastest approach.