If we use a combination of the Dataset
and Dataloader
classes (as shown below), I have to explicitly load the data onto the GPU using .to()
or .cuda()
. Is there a way to instruct the dataloader to do it automatically/implicitly?
Code to understand/reproduce the scenario:
from torch.utils.data import Dataset, DataLoader
import numpy as np
class DemoData(Dataset):
def __init__(self, limit):
super(DemoData, self).__init__()
self.data = np.arange(limit)
def __len__(self):
return self.data.shape[0]
def __getitem__(self, idx):
return (self.data[idx], self.data[idx]*100)
demo = DemoData(100)
loader = DataLoader(demo, batch_size=50, shuffle=True)
for i, (i1, i2) in enumerate(loader):
print('Batch Index: {}'.format(i))
print('Shape of data item 1: {}; shape of data item 2: {}'.format(i1.shape, i2.shape))
# i1, i2 = i1.to('cuda:0'), i2.to('cuda:0')
print('Device of data item 1: {}; device of data item 2: {}\n'.format(i1.device, i2.device))
Which will output the following; note - without explicit device transfer instruction, the data is loaded onto CPU:
Batch Index: 0
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu
Batch Index: 1
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu
A possible solution is at this PyTorch GitHub repo. Issue(still open at the time this question was posted), but, I am unable to make it to work when the dataloader has to return multiple data-items!
You can modify the collate_fn
to handle several items at once:
from torch.utils.data.dataloader import default_collate
device = torch.device('cuda:0') # or whatever device/cpu you like
# the new collate function is quite generic
loader = DataLoader(demo, batch_size=50, shuffle=True,
collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))
Note that if you want to have multiple workers for the dataloader, you'll need to add
torch.multiprocessing.set_start_method('spawn')
after your if __name__ == '__main__'
(see this issue).
Having said that, it seems like using pin_memory=True
in your DataLoader
would be much more efficient. Have you tried this option?
See memory pinning for more information.
Update (Feb 8th, 2021)
This post made me look at my "data-to-model" time spent during training.
I compared three alternatives:
DataLoader
works on CPU and only after the batch is retrieved data is moved to GPU.pin_memory=True
in DataLoader
.collate_fn
to move data to GPU.From my limited experimentation it seems like the second option performs best (but not by a big margin).
The third option required fussing about the start_method
of the data loader processes, and it seems to incur an overhead at the beginning of each epoch.