I'm trying to train my network model and I need a dataloader for that. The dataloader is returning different results in the first iteration over the dataset compared to the second iteration. How is this possible?
Code:
from Pointcloud.Modules.FileDataset import FileDataset
from Pointcloud.Modules import Config as config
dm = FileDataset(config.DATA_DIR, split_name=config.SPLIT_NAME, split=config.SPLIT)
train_dl = dm.train_dataloader(config.BATCH_SIZE, config.NUM_WORKERS)
for i in range(2):
sum = 0
for j, minibatch in enumerate(train_dl):
if j < 8: # Only showing the first 8 iterations to show the problem.
print(j, minibatch.batch.unique(return_counts=True))
sum += 1
print(f"Iteration {i}\nNumber of iterations: {sum}\nNumber of batches: {len(train_dl.dataset) / config.BATCH_SIZE}")
Output:
train bs: 25008
val bs: 8335
test bs: 8339
0 (tensor([0], device='cuda:0'), tensor([756], device='cuda:0'))
1 (tensor([0], device='cuda:0'), tensor([756], device='cuda:0'))
2 (tensor([0], device='cuda:0'), tensor([774], device='cuda:0'))
3 (tensor([0], device='cuda:0'), tensor([787], device='cuda:0'))
4 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([39, 60, 44, 54, 52, 50, 51, 48, 36, 54, 61, 51, 51, 56, 55, 48],
device='cuda:0'))
5 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([47, 54, 56, 50, 51, 26, 75, 48, 43, 52, 55, 53, 35, 38, 51, 43],
device='cuda:0'))
6 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([47, 62, 53, 50, 54, 53, 52, 48, 57, 50, 53, 43, 53, 47, 59, 56],
device='cuda:0'))
7 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([52, 39, 43, 49, 52, 40, 49, 52, 52, 72, 57, 36, 28, 53, 50, 52],
device='cuda:0'))
Iteration 0
Number of iterations: 1563
Number of batches: 1563.0
0 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([43, 46, 50, 52, 53, 49, 63, 50, 36, 49, 53, 51, 40, 51, 56, 46],
device='cuda:0'))
1 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([50, 64, 44, 55, 54, 53, 53, 50, 49, 44, 52, 51, 47, 38, 54, 45],
device='cuda:0'))
2 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([44, 52, 49, 53, 49, 48, 52, 51, 44, 51, 42, 49, 53, 61, 46, 50],
device='cuda:0'))
3 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([59, 50, 40, 47, 33, 54, 59, 63, 53, 46, 44, 40, 49, 52, 53, 36],
device='cuda:0'))
4 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([54, 53, 49, 55, 54, 46, 28, 43, 45, 46, 51, 60, 49, 48, 54, 48],
device='cuda:0'))
5 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([54, 44, 59, 50, 45, 50, 35, 40, 54, 47, 39, 52, 52, 39, 50, 55],
device='cuda:0'))
6 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([49, 51, 42, 40, 51, 46, 48, 52, 45, 52, 38, 43, 50, 44, 49, 41],
device='cuda:0'))
7 (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
device='cuda:0'), tensor([49, 54, 61, 52, 48, 50, 36, 47, 49, 56, 41, 43, 52, 52, 54, 39],
device='cuda:0'))
Iteration 1
Number of iterations: 1563
Number of batches: 1563.0
The dataset is a list of Pytorch Geometric Data objects, with the first 5 elements being:
[Data(x=[59, 8], edge_index=[2, 357], y=[1, 3]),
Data(x=[50, 8], edge_index=[2, 292], y=[1, 3]),
Data(x=[52, 8], edge_index=[2, 324], y=[1, 3]),
Data(x=[56, 8], edge_index=[2, 362], y=[1, 3]),
Data(x=[44, 8], edge_index=[2, 256], y=[1, 3])]
The creation of the dataloader is done by the following function:
def train_dataloader(self, batch_size, num_workers):
return tg_loader_DataLoader(
dataset=self.train_ds,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
persistent_workers=True,
drop_last=True
)
Can someone help me with why the dataloader is not loading correctly? It is merging batches now, setting all values within the batch to zero. This results in nonsense data, which crashes the forwards pass through the model and also misses data to train on, since some data is overwritten :(
As described in the problem above
The tensors from my dataset were on the GPU. As described in the second warning on this page, it is recommended to store your dataset on the cpu and use a Dataloader to put it on the GPU.