pytorchparallel-processingdistributed-computing

How does model.to(rank) work if rank is an integer? (DistributedDataParallel)


I was looking at the basic implementation of DDP:

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn(outputs, labels).backward()
    optimizer.step()

    cleanup()


def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

Just wondering how PyTorch knows which GPU to put the model on just based off of rank? Usually we specify a torch.device() object to a model. How does Pytorch interpret it when the to() function is provided an integer?


Solution

  • By default, if an integer i is provided as an argument to torch.Tensor.to, it will consider the i-th cuda device. Here is a test:

    >>> torch.rand(0).to(0).device
    device(type='cuda', index=0)
    
    >>> torch.rand(0, device=0).device
    device(type='cuda', index=0)
    

    Which means .to(0) will be same as .to('cuda:0'), to(torch.device('cuda')), or even .cuda(), which defaults to the first device ie. cuda:0.