pytorchlstmlossmse

LSTM: calculating MSELoss in for loop returns NAN when backward pass


I am new with LSTM and ran into a problem. I'm trying to predict a variable using 7 features in time steps of 4. I am working with PyTorch.

Data

From my initial data frame (traindf), I created tensors for every feature and the target (Y) by:

featureX_train = torch.tensor(traindf.featureX[:test].values).view(-1, 4, 1)
Y_train = torch.tensor(traindf.Y[:test].values).view(-1, 4, 1)
...
featureX_test = torch.tensor(traindf.featureX[test:].values).view(-1, 4, 1)
Y_test = torch.tensor(traindf.Y[test:].values).view(-1, 4, 1)

I concatenated all the feature tensors into one X_train and one X_test. All tensors are float32:

print(X_train.shape, Y_train.shape)
print(X_test.shape, Y_test.shape) 
torch.Size([24436, 4, 7]) torch.Size([24436, 4, 1])
torch.Size([6109, 4, 7]) torch.Size([6109, 4, 1])

Eventually, I have a train and test data set:

train_dataset = TensorDataset(X_train, Y_train)
test_dataset = TensorDataset(X_test, Y_test)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)

Preview of my data:

print(train_dataset[0])
print(test_dataset[0])
(tensor([[ 7909.0000,  8094.0000,  9119.0000,  8666.0000, 17599.0000, 13657.0000,
         10158.0000],
        [ 7909.0000,  8073.0000,  9119.0000,  8636.0000, 17609.0000, 13975.0000,
         10109.0000],
        [ 7939.5000,  8083.5000,  9166.5000,  8659.5000, 18124.5000, 13971.0000,
         10142.0000],
        [ 7951.0000,  8064.0000,  9201.0000,  8663.0000, 17985.0000, 13967.0000,
         10076.0000]]), tensor([[41.],
        [41.],
        [41.],
        [41.]]))
(tensor([[ 8411.0000,  8530.0000,  9439.0000,  9101.0000, 17368.0000, 14174.0000,
         11111.0000],
        [ 8460.0000,  8651.5000,  9579.5000,  9355.5000, 17402.0000, 14509.0000,
         11474.5000],
        [ 8436.0000,  8617.0000,  9579.0000,  9343.0000, 17318.0000, 14288.0000,
         11404.0000],
        [ 8519.0000,  8655.0000,  9580.0000,  9348.0000, 17566.0000, 14640.0000,
         11404.0000]]), tensor([[59.],
        [59.],
        [59.],
        [59.]]))

Applying LSTM model

My LSTM model:

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size)
        self.linear = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        x, _ = self.lstm(x)
        # x = self.linear(x[:, -1, :])
        x = self.linear(x)
        return x

model = LSTMModel(input_size=7, hidden_size=32, output_size=1)

loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
  
model.train()

When I try:

for X, Y in train_loader:
    optimizer.zero_grad()
    
    Y_pred = model(X)
    
    loss = loss_fn(Y_pred, Y)

print(loss)

I get (correctly I assume) Loss: tensor(1318.9419, grad_fn=<MseLossBackward0>)

However, when I run:

for X, Y in train_loader:
    optimizer.zero_grad()
    
    Y_pred = model(X)

    loss = loss_fn(Y_pred, Y)
    
    # Now apply backward pass
    loss.backward()
    
    optimizer.step()

print(loss)

I get: tensor(nan, grad_fn=<MseLossBackward0>)

Tried normalizing

I have tried normalizing the data:

mean = X.mean()
    std = X.std()
    X_normalized = (X - mean) / std

    Y_pred = model(X_normalized)

But it yields the same result. Why do I yield 'nan' after applying loss.backward() in such a loop? How can I fix this? Thanks in advance!


Solution

  • My X_train contained few nan values. By removing the matrices with nan values, I solved this issue:

    mask = torch.isnan(X_train).any(dim=1).any(dim=1)
    X_train = X_train[~mask]
    
    # Do the same for Y_train as it needs to be the same size
    Y_train = Y_train[~mask]
    
    # Create the TensorDataset for the training set
    train_dataset = TensorDataset(X_train, Y_train)