When trying to run an example python file via torch.distributed.run
on 2 Nodes with 2 GPUs each on a cluster by using a SLURM script I encounter the following error:
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:16773 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [clara06.url.de]:16773 (errno: 97 - Address family not supported by protocol).
This is the SLURM script:
#!/bin/bash
#SBATCH --job-name=distribution-test # name
#SBATCH --nodes=2 # nodes
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=4 # number of cores per tasks
#SBATCH --partition=clara
#SBATCH --gres=gpu:v100:2 # number of gpus
#SBATCH --time 0:15:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
module load Python
pip install --user -r requirements.txt
MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
GPUS_PER_NODE=2
LOGLEVEL=INFO python -m torch.distributed.run --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES torch-distributed-gpu-test.py
and the python code that should be running:
import fcntl
import os
import socket
import torch
import torch.distributed as dist
def printflock(*msgs):
"""solves multi-process interleaved print problem"""
with open(__file__, "r") as fh:
fcntl.flock(fh, fcntl.LOCK_EX)
try:
print(*msgs)
finally:
fcntl.flock(fh, fcntl.LOCK_UN)
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
hostname = socket.gethostname()
gpu = f"[{hostname}-{local_rank}]"
try:
# test distributed
dist.init_process_group("nccl")
dist.all_reduce(torch.ones(1).to(device), op=dist.ReduceOp.SUM)
dist.barrier()
# test cuda is available and can allocate memory
torch.cuda.is_available()
torch.ones(1).cuda(local_rank)
# global rank
rank = dist.get_rank()
world_size = dist.get_world_size()
printflock(f"{gpu} is OK (global rank: {rank}/{world_size})")
dist.barrier()
if rank == 0:
printflock(f"pt={torch.__version__}, cuda={torch.version.cuda}, nccl={torch.cuda.nccl.version()}")
except Exception:
printflock(f"{gpu} is broken")
raise
I have tried different python runs like this:
LOGLEVEL=INFO python -m torch.distributed.run --master_addr $MASTER_ADDR --master_port $MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES torch-distributed-gpu-test.py
LOGLEVEL=INFO torchrun --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES torch-distributed-gpu-test.py
LOGLEVEL=INFO python -m torch.distributed.launch --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES torch-distributed-gpu-test.py
All resulting in the same error.
I have tried specifing the IP Adress explicitly instead of the MASTER_ADDR
IP_ADDRESS=$(srun hostname --ip-address | head -n 1)
/etc/resolv.conf
: the hostnames are clearly mapped.ipv4
to the MASTER_ADDR with no success.Adress Family not Found Errors are related to IPv4 and IPv6 versions. As my service did not provide an ipv6 connection between nodes, these errors occured.
But they can be understood as warnings, the connection via IPv4 was still established.
I did not find any solution on disabling IPv6 connections, but as they are just "information" so to say, I ignored them