The following PyTorch code for single-node multi-GPU training with DDP seen here:
https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py
Gives the following error when running in a Kaggle environment with two GPU T4 accelerators:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/opt/conda/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'main' on <module '__main__' (built-in)>
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/opt/conda/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'main' on <module '__main__' (built-in)>
---------------------------------------------------------------------------
ProcessExitedException Traceback (most recent call last)
Cell In[11], line 104
95 if __name__ == "__main__":
96 # import argparse
97 # parser = argparse.ArgumentParser(description='simple distributed training job')
(...)
100 # parser.add_argument('--batch_size', default=32, type=int, help='Input batch size on each device (default: 32)')
101 # args = parser.parse_args()
103 world_size = torch.cuda.device_count()
--> 104 mp.spawn(main, args=(world_size, 5, 10, 32), nprocs=world_size)
Any Information is appreciated.
To make the DDP code work when running in a notebook, you must include:
%%writefile ddp.py
at the top of the DDP code.
To run the code, and train the model, in another cell call:
!python -W ignore ddp.py