pythondeep-learningpytorchgpukaggle

My PyTorch Model in Kaggle uses 100% CPU and 0% GPU During Training


I'm trying to fine-tune a PyTorch classification model to classify plant disease images. I have properly initialized the CUDA device and sent the model, train, validation, and test data to the device. However, when training the model, it uses 100% CPU and 0% GPU. Why is this happening?

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, patience):
    train_losses, val_losses = [], []
    train_accuracies, val_accuracies = [], []
    best_val_loss = np.inf
    patience_counter = 0
    lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

    for epoch in range(num_epochs):
        model.train()
        running_loss, running_corrects, total = 0.0, 0, 0

        for inputs, labels in tqdm(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * inputs.size(0)
            _, preds = torch.max(outputs, 1)
            running_corrects += torch.sum(preds == labels.data)
            total += labels.size(0)

        epoch_loss = running_loss / total
        epoch_acc = running_corrects.double() / total
        train_losses.append(epoch_loss)
        train_accuracies.append(epoch_acc.item())

        val_loss, val_acc = evaluate_model(model, val_loader, criterion)
        val_losses.append(val_loss)
        val_accuracies.append(val_acc)
        
        print(f"Epoch {epoch}/{num_epochs-1}, Train Loss: {epoch_loss:.4f}, Train Acc: {epoch_acc:.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), 'best_model.pth')
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping")
                break

        lr_scheduler.step()

    return train_losses, train_accuracies, val_losses, val_accuracies

Here is my notebook: EfficientNet with Augmentation

Edit: I resize the dataset into 256X256 and run the code. When I'm checking my GPU usage surprisingly usage is periodically going up and down. As @kale-kundert said there is a performance bottleneck in data loading and applying augmenting pipeline.

enter image description here


Solution

  • One issue I see is that you aren't using multiple processes to load training examples. If your program is spending way more time loading training examples (done by the CPU) than actually training on them (done by the GPU), that could explain the low GPU utilization.

    To be more specific, each time your dataset loads an example, it has to parse a file and apply a bunch of data augmentations. It's hard to be sure by just reading the code, but both of those steps could be pretty expensive.

    If this is actually the problem, here are two possible ways to fix it. First is to use multiple dataloader processes. This is really easy to do; just pass the num_workers argument to the DataLoader constructor. The only downside with this approach is that you need a lot of CPUs to make the most of it, and cloud providers might not give you very many. Second is to preload the entire dataset. This is only an option for relatively small datasets, but if that applies to you, and if you can cache the results so you don't need to do the full preloading every time, this would probably be the fastest approach.