python-3.xpytorchgpupytorch-dataloader

PyTorch: while loading batched data using Dataloader, how to transfer the data to GPU automatically


If we use a combination of the Dataset and Dataloader classes (as shown below), I have to explicitly load the data onto the GPU using .to() or .cuda(). Is there a way to instruct the dataloader to do it automatically/implicitly?

Code to understand/reproduce the scenario:

from torch.utils.data import Dataset, DataLoader
import numpy as np

class DemoData(Dataset):
    def __init__(self, limit):
        super(DemoData, self).__init__()
        self.data = np.arange(limit)

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, idx):
        return (self.data[idx], self.data[idx]*100)

demo = DemoData(100)

loader = DataLoader(demo, batch_size=50, shuffle=True)

for i, (i1, i2) in enumerate(loader):
    print('Batch Index: {}'.format(i))
    print('Shape of data item 1: {}; shape of data item 2: {}'.format(i1.shape, i2.shape))
    # i1, i2 = i1.to('cuda:0'), i2.to('cuda:0')
    print('Device of data item 1: {}; device of data item 2: {}\n'.format(i1.device, i2.device))

Which will output the following; note - without explicit device transfer instruction, the data is loaded onto CPU:

Batch Index: 0
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu

Batch Index: 1
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu

A possible solution is at this PyTorch GitHub repo. Issue(still open at the time this question was posted), but, I am unable to make it to work when the dataloader has to return multiple data-items!


Solution

  • You can modify the collate_fn to handle several items at once:

    from torch.utils.data.dataloader import default_collate
    
    device = torch.device('cuda:0')  # or whatever device/cpu you like
    
    # the new collate function is quite generic
    loader = DataLoader(demo, batch_size=50, shuffle=True, 
                        collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))
    

    Note that if you want to have multiple workers for the dataloader, you'll need to add

    torch.multiprocessing.set_start_method('spawn')
    

    after your if __name__ == '__main__' (see this issue).

    Having said that, it seems like using pin_memory=True in your DataLoader would be much more efficient. Have you tried this option?
    See memory pinning for more information.


    Update (Feb 8th, 2021)
    This post made me look at my "data-to-model" time spent during training. I compared three alternatives:

    1. DataLoader works on CPU and only after the batch is retrieved data is moved to GPU.
    2. Same as (1) but with pin_memory=True in DataLoader.
    3. The proposed method of using collate_fn to move data to GPU.

    From my limited experimentation it seems like the second option performs best (but not by a big margin).
    The third option required fussing about the start_method of the data loader processes, and it seems to incur an overhead at the beginning of each epoch.