pythonpytorchdistributed-computing

PyTorch distributed from two ec2 instances hangs


# env_vars.sh on rank 0 machine
#!/bin/bash

export MASTER_PORT=23456
export MASTER_ADDR=... # same as below, private ip of machine 0
export WORLD_SIZE=2
export GLOO_SOCKET_IFNAME=enX0
export RANK=0 

# env_vars.sh on rank 1 machine

#!/bin/bash
export MASTER_PORT=23456
export MASTER_ADDR=... # same as above
export WORLD_SIZE=2
export GLOO_SOCKET_IFNAME=enX0
export RANK=1

# on rank 0 machine
$ ifconfig
enX0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet ...  netmask 255.255.240.0  broadcast ...
        inet6 ...  prefixlen 64  scopeid 0x20<link>
        ether ...  txqueuelen 1000  (Ethernet)
        RX packets 543929  bytes 577263126 (550.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 203942  bytes 21681067 (20.6 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 12  bytes 1020 (1020.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 12  bytes 1020 (1020.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
$ conda activate pytorch_env
$ . env_vars.sh
$ python
>>> import torch.distributed
>>> torch.distributed.init_process_group('gloo')

# Do the same on rank 0 machine

After 30 seconds or so, machine 0 outputs the following, and machine 1 just continues to hang.

[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with [/opt/conda/conda-bld/pytorch_1699449045860/work/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ec2-user/miniconda3/envs/pytorch_env/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/ec2-user/miniconda3/envs/pytorch_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1155, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/home/ec2-user/miniconda3/envs/pytorch_env/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1293, in _new_process_group_helper
    backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: Gloo connectFullMesh failed with [/opt/conda/conda-bld/pytorch_1699449045860/work/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

I can connect to the rank 0 machine from the rank 1 machine:

# rank 0 machine
nc -lk 23456
# rank 1 machine
telnet … 23456 # use private ip address of rank 0 machine
Trying ...
Connected to …
Escape character is '^]'.
ping
# rank 0 machine
ping

If I run all the same commands from two shells of the rank 0 machine (modifying one of them with export RANK=1), init_process_group completes execution as expected.

A user posted here about the same error, which they said they solved by resetting GLOO_SOCKET_IFNAME and TP_SOCKET_IFNAME. Trying to do a similar thing on my machine didn't succeed.


Solution

  • I solved this problem by enabling All Traffic between my nodes. Initially, I was just allowing the MASTER_PORT and that was not enough.