python machine-learning deep-learning pytorch

Why facing "CUDA error: device-side assert triggered" while training LSTM model?

I am learning Pytorch and deep learning, I am working on a dataset containing 4-features. My problem statement is multiclass classification problem, total 9 possible output 1 to 9.

Gene which is categorical type. Variation which is categorical type. Text which is textual data. My LSTM model have 2 embedding layers for categorical data and 1 for textual data, 1 LSTM with layers=1(for testing only).

I have converted my textual data to numerical representation. Encoded Categorical data using LabelEncoder()

Using DataLoader for loading data in batch and using collate_fn() for truncating (because texts are too long) and padding on each batch.

As my problem statement belongs to multiclass classification, I am using torch.nn.CrossEntropyLoss(weight=class_weights) as a loss function and Adam as an optimizer.

As I said texts are too long so my collate_fn() function will take batch as an input and each data in batch are already converted in numerical representation and here comparing if size of each text is greater then 1500 if yes truncate them and then perform padding.

I have RTX3050 with 4gb of VRAM. So decided to truncate earlier it was giving cuda output of memory error in first forward pass only i.e in:

outputs = model(text_input.long(), gene_input.long(), variance_input.long())

I trained my model for only 1-epcoch training goes well(I mean no error) but during validation, I faced following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[18], line 58
     55 print(type(labels))
     57 outputs = model(text_input.long(), gene_input.long(), variance_input.long())
---> 58 print(outputs)
     59 print(outputs.shape)
     60 print(type(outputs))

File u:\nlp_project\Personalized-Medicine-Redefining-Cancer-Treatment\venv\lib\site-packages\torch\_tensor.py:568, in Tensor.__repr__(self, tensor_contents)
    564     return handle_torch_function(
    565         Tensor.__repr__, (self,), self, tensor_contents=tensor_contents
    566     )
    567 # All strings are unicode in Python 3.
--> 568 return torch._tensor_str._str(self, tensor_contents=tensor_contents)

File u:\nlp_project\Personalized-Medicine-Redefining-Cancer-Treatment\venv\lib\site-packages\torch\_tensor_str.py:704, in _str(self, tensor_contents)
    702 with torch.no_grad(), torch.utils._python_dispatch._disable_current_modes():
    703     guard = torch._C._DisableFuncTorch()
--> 704     return _str_intern(self, tensor_contents=tensor_contents)

File u:\nlp_project\Personalized-Medicine-Redefining-Cancer-Treatment\venv\lib\site-packages\torch\_tensor_str.py:621, in _str_intern(inp, tensor_contents)
    619                     tensor_str = _tensor_str(self.to_dense(), indent)
    620                 else:
--> 621                     tensor_str = _tensor_str(self, indent)
...
    151         return

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

As we can see in code during print(outputs) I am getting error this is not the case in validation period I faced this error to early or after completing some% of validation, but only statements having outputs variable.

I am sharing my Model and Training code as bellow:

MODEL:

import torch
import torch.nn as nn
import torch.optim as optim

class MultiClassLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, gene_size, variance_size, gene_emb_dim, variance_emb_dim):
        super(MultiClassLSTM, self).__init__()

        # Text feature embedding + LSTM
        self.text_embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(input_size=embed_dim, hidden_size=hidden_dim, num_layers=1, batch_first=True)

        # Categorical feature embeddings
        self.gene_embedding = nn.Embedding(gene_size, gene_emb_dim)
        self.variance_embedding = nn.Embedding(variance_size, variance_emb_dim)
        # Fully connected layer for classification
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim + gene_emb_dim + variance_emb_dim, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, text_input, gene_input, variance_input):
        # Process text input through embedding and LSTM
        text_embedded = self.text_embedding(text_input)
        lstm_out, _ = self.lstm(text_embedded)
        lstm_out = lstm_out[:, -1, :]  # Take the last hidden state

        # Process categorical inputs through embeddings
        gene_embedded = self.gene_embedding(gene_input).squeeze(1)
        variance_embedded = self.variance_embedding(variance_input).squeeze(1)

        # Concatenate all features
        combined = torch.cat((lstm_out, gene_embedded, variance_embedded), dim=1)

        # Classification output
        output = self.fc(combined)
        return output


# Model Initialization
model = MultiClassLSTM(vocab_size, embed_dim, hidden_dim, num_classes, gene_size, variance_size, gene_emb_dim, variance_emb_dim)


y_full_np = np.concatenate([y_train, y_test, y_val])  # Full dataset labels
# unique_classes = np.unique(y_full_np)[1:]
unique_classes = np.array([0,1,2,3,4,5,6,7,8])
# print(unique_classes)
class_weights = compute_class_weight(class_weight="balanced", classes=np.array([0,1,2,3,4,5,6,7,8]), y=y_full_np)
class_weights = torch.tensor(class_weights, dtype=torch.float32, device=device)

# Define loss function with class weights
criterion = torch.nn.CrossEntropyLoss(weight=class_weights)

optimizer = optim.Adam(model.parameters(), lr=0.001)

optimizer.zero_grad()

TRANING

num_epochs = 1
train_losses = []
val_losses = []
os.environ["TORCH_USE_CUDA_DSA"] = "1"
import os
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:2024"

model.to(device)
for epoch in range(num_epochs):
    # torch.cuda.empty_cache()
    model.train()  # Set model to training mode
    total_train_loss = 0

    for batch in tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{num_epochs} [Training]"):
        text_input, gene_input, variance_input, labels = batch

        # Move to device (if using GPU)
        text_input = text_input.to(device)
        gene_input = gene_input.to(device)
        variance_input = variance_input.to(device)
        labels = labels.to(device)  # Labels should be integer class indices

        # print(text_input.device, gene_input.device, variance_input.device, labels.device)

        optimizer.zero_grad()  # Clear previous gradients
        
        outputs = model(text_input.long(), gene_input.long(), variance_input.long())

        # Compute Log Loss
        loss = criterion(outputs, labels)

        # Backward pass
        loss.backward()
        optimizer.step()

        total_train_loss += loss.item()

    # Compute average training loss
    avg_train_loss = total_train_loss / len(train_dataloader)
    train_losses.append(avg_train_loss)

    # ================== Validation Phase ==================
    model.eval()  # Set model to evaluation mode
    total_val_loss = []

    with torch.no_grad():  # No gradient calculation during validation
        for batch in tqdm(validation_dataloader, desc=f"Epoch {epoch+1}/{num_epochs} [Validation]"):
            text_input, gene_input, variance_input, labels = batch
            text_input = text_input.to(device)
            gene_input = gene_input.to(device)
            variance_input = variance_input.to(device)
            labels = labels.to(device)
            print(labels)
            print(labels.shape)
            print(type(labels))

            outputs = model(text_input.long(), gene_input.long(), variance_input.long())
            print(outputs)
            print(outputs.shape)
            print(type(outputs))
            loss = criterion(outputs, labels)
            print(loss)          
            total_val_loss.append(loss.item())
            gc.collect()
            torch.cuda.empty_cache()
            print("----------------")

    avg_val_loss = sum(total_val_loss) / len(validation_dataloader)
    val_losses.append(avg_val_loss)

    print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

# Store losses for future use
torch.save({'train_loss': train_losses, 'val_loss': val_losses}, 'losses.pth')

I used some print statement to see if shape or datatype is creating problem, I have deleted the code, but I tested if in output I am getting nan or inf because of learning rate but didn't help. I saw some similar problem on pytorch-forum as well but didn't understand.

Thanks in advance. I hope to hear from you soon.

Solution

So couple things, first you should understand the error. That error is triggered either when you are getting NaN or inf values (which are obviously a problem), or you are giving something invalid target values e.g. passing in something out of bounds or with incorrect shapes.

Some potential issues I see in no particular order:

check for correct shape and type of labels in training fn.
if your labels are of type torch.float convert them using labels = labels.long() and if your shapes mismatch then
check shapes of outputs going into CE Loss.
CrossEntropyLoss expects a shape of (batch_size, num_classes), make sure outputs is the correct shape.
Ensure no nan or inf values in forward pass

if torch.isnan(outputs).any() or torch.isinf(outputs).any():
    print("Found NaN or Inf")

If this print is triggered, try playing with hyper params (e.g. lower learning rate, clipping gradients etc)

Put a ton of print statements everywhere and ensure the shapes of all tensors are what you expect them to be. Honestly the hardest part of ml for me is keeping track of tensor shapes since they cause most of my problems lol. Put comments next to every line with tensors of what the input and output shape of those lines should be and put print statements to verify.

Failing all that try running it on CPU, you may get more insightful error messages.