pythonpytorchartificial-intelligenceslurmmulti-gpu

PyTorch Distributed Run with SLURM results in "Adress family not found"


When trying to run an example python file via torch.distributed.run on 2 Nodes with 2 GPUs each on a cluster by using a SLURM script I encounter the following error:

[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:16773 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [clara06.url.de]:16773 (errno: 97 - Address family not supported by protocol).

This is the SLURM script:

#!/bin/bash
#SBATCH --job-name=distribution-test        # name
#SBATCH --nodes=2                           # nodes
#SBATCH --ntasks-per-node=1                 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=4                   # number of cores per tasks
#SBATCH --partition=clara
#SBATCH --gres=gpu:v100:2                   # number of gpus
#SBATCH --time 0:15:00                      # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out                  # output file name

module load Python
pip install --user -r requirements.txt
MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
GPUS_PER_NODE=2

LOGLEVEL=INFO python -m torch.distributed.run --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES  torch-distributed-gpu-test.py

and the python code that should be running:

import fcntl
import os
import socket

import torch
import torch.distributed as dist


def printflock(*msgs):
    """solves multi-process interleaved print problem"""
    with open(__file__, "r") as fh:
        fcntl.flock(fh, fcntl.LOCK_EX)
        try:
            print(*msgs)
        finally:
            fcntl.flock(fh, fcntl.LOCK_UN)


local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
hostname = socket.gethostname()

gpu = f"[{hostname}-{local_rank}]"

try:
    # test distributed
    dist.init_process_group("nccl")
    dist.all_reduce(torch.ones(1).to(device), op=dist.ReduceOp.SUM)
    dist.barrier()

    # test cuda is available and can allocate memory
    torch.cuda.is_available()
    torch.ones(1).cuda(local_rank)

    # global rank
    rank = dist.get_rank()
    world_size = dist.get_world_size()

    printflock(f"{gpu} is OK (global rank: {rank}/{world_size})")

    dist.barrier()
    if rank == 0:
        printflock(f"pt={torch.__version__}, cuda={torch.version.cuda}, nccl={torch.cuda.nccl.version()}")

except Exception:
    printflock(f"{gpu} is broken")
    raise

I have tried different python runs like this:

LOGLEVEL=INFO python -m torch.distributed.run --master_addr $MASTER_ADDR --master_port $MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES  torch-distributed-gpu-test.py
LOGLEVEL=INFO torchrun --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES  torch-distributed-gpu-test.py
LOGLEVEL=INFO python -m torch.distributed.launch --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES  torch-distributed-gpu-test.py

All resulting in the same error.

I have tried specifing the IP Adress explicitly instead of the MASTER_ADDR

IP_ADDRESS=$(srun hostname --ip-address | head -n 1)

Solution

  • Adress Family not Found Errors are related to IPv4 and IPv6 versions. As my service did not provide an ipv6 connection between nodes, these errors occured.

    But they can be understood as warnings, the connection via IPv4 was still established.

    I did not find any solution on disabling IPv6 connections, but as they are just "information" so to say, I ignored them