pythonpytorchpytorch-dataloader

Pytorch Dataloader adding a batch dimension


I think this question was already asked a few times but I am yet to find a good answer here.

So I have a Pytorch Dataset that is made from 2 numpy arrays.

The following are the dimensions.

features = [10000, 450, 28] numpy array. dim_0 = the number of sample, dim_1 = time series, dim_2 = features. Basically I have a data that is 450 frames long, where each frame contains 28 features and I have 10000 samples.

label = [10000,450] numpy array. dim_0 = number of samples, dim_1 = label per each frame.

The assignment is that I need to do a classification for each frame.

I created a Pytorch custom Dataset and Dataloader using the following function.

label_length = label.size
label = torch.from_numpy(label)
features = torch.from_numpy(features)

train_dataset = Dataset(label, features, label_length)

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)

As expected, the train_dataloader.dataset.data returns a tensor of size [10000,450,28] Great! Now just need to take the batches from the 10000 sample and loop! So I run a code like below - assume that the optimizers/loss function are all set.

train_loss = 0
EPOCHS = 3
for epoch_idx in range(EPOCHS):
    for i, data in enumerate(train_dataloader):
        inputs, labels = data
        print(inputs.size())
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

But I get this error:

ValueError: LSTM: Expected input to be 2D or 3D, got 4D instead

When I checked the dimension of inputs, it gave [64 x 10000 x 450 x 28]

Why does dataloader add this dimension of batch? (I understand per documentation it is supposed to do it, but I think it should take 64 samples out of 10000 and create batches and loop over each batch?

I think I am making a mistake somewhere but cannot pin point what I am doing wrong...

EDIT: This is my simple Dataset class

class Dataset(torch.utils.data.Dataset):
    def __init__(self, label, data, length):
        self.labels = label
        self.data = data
        self.length = length

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        # need to create tensor
        #data = torch.from_numpy(self.data)
        #labels = torch.from_numpy(self.labels).type(torch.LongTensor)
        data = self.data
        labels = self.labels
        return data, labels

Solution

  • Dataloader adds a batch dimension, it is one of the purposes of the dataloader. And most pytorch function/layers expect a batched input too.

    The issue there is your __getitem__ function that should return only ONE sample, not the whole dataset. You need to use the idx argument and return something like data[idx,:,:] and labels[idx, :] instead of the whole data, labels arrays. Otherwise each batch will contain 64 copies of the whole dataset instead of 64 samples, which is of course not what you want.

    By the way, you probably do not need the length argument at it is already contained in the shape of your labels or data (this is the first dimension, 10000).