pythondockerartificial-intelligenceamd-gpugfx

torch.distributed.elastic.multiprocessing.errors.ChildFailedError / "HIP error: invalid device function" following AMD AI training docs


I'm new on AI developing and I'm trying to train a model, the thing is that there are two AMD GPUs on the server that are Radeon RX 7600 XT and the CPU is a Ryzen 9 5900XT 16-Core and I already had several problems when I wrote the training code from scratch (one of them was OOM on GPU, I couldn't use both GPUs because of libraries compatibility between torch and rocm). I changed my approach and I followed the official AMD Documentation for AI Training. After all the setup explained on this guide and run the command that starts the training this is the Traceback:

Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Writing logs to /workspace/notebooks/result/logs/log_1744287800.txt
Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
[rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in <module>
[rank1]:     sys.exit(recipe_main())
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank1]:     sys.exit(recipe_main(conf))
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main
[rank1]:     recipe.setup(cfg=cfg)
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup
[rank1]:     self._model = self._setup_model(
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model
[rank1]:     m.rope_init()
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init
[rank1]:     ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim)
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank1]:     return func(*args, **kwargs)
[rank1]: RuntimeError: HIP error: invalid device function
[rank1]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank1]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in <module>
[rank0]:     sys.exit(recipe_main())
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank0]:     sys.exit(recipe_main(conf))
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main
[rank0]:     recipe.setup(cfg=cfg)
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup
[rank0]:     self._model = self._setup_model(
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model
[rank0]:     m.rope_init()
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init
[rank0]:     ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim)
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]: RuntimeError: HIP error: invalid device function
[rank0]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank0]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

[rank0]:[W410 12:23:26.761619117 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0410 12:23:27.490000 7605 site-packages/torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 7738) of binary: /opt/conda/envs/py_3.10/bin/python3
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
    parser.run(args)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
    args.func(args)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
    run(args)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py FAILED

The whole thing is run into a docker as per documentation. I tried to write an environment variable on bashrc to set it permanently because I thought the problem would it be that my gfx was not supported: echo "export PYTORCH_ROCM_ARCH=gfx1102" >> ~/.bashrc source ~/.bashrc then clone the git project locally, then

cd /workspace/pytorch
pip install -r requirements.txt
python setup.py install

} So ultimately I retried to run the training with tune run --nproc_per_node 2 full_finetune_distributed --config /workspace/notebooks my_custom_config_distributed.yaml but it gave me the same error. There is an official AMD page where they wrote a table for compatible GPUs prerequisites in order to make it and my GPU isn't in that table but I found online someone that tried to follow this guide even if their GPU wasn't on that table and they said they pulled it off somehow... I know that there is a possibility to do it, but I can't find out how.

P.S. The server it's a Linux Ubuntu, more information:

PRETTY_NAME="Ubuntu 22.04.5 LTS"

NAME="Ubuntu"

VERSION_ID="22.04"

VERSION="22.04.5 LTS (Jammy Jellyfish)".

Solution

  • Referring to the discussion that took place in the comments under my own question, I'm sharing here the complete code for better clarity. But first I'll resume what we have discussed in the comments.
    In the comments, we discussed github issue where other people had my same problem.
    I followed some solutions posted there from users and here's the crucial adjustments that finally solved the HIP Error.

    In the /workspace folder inside the docker, I first updated the .bashrc with new environment variables:
    with nano ~/.bashrc I updated this variable:
    export PYTORCH_ROCM_ARCH=gfx1102
    Into this:
    export PYTORCH_ROCM_ARCH=gfx1031 as someone wrote on the git issue.
    Then I added a crucial environment variable which is:
    export HSA_OVERRIDE_GFX_VERSION=10.3.1, and obviously source ~/.bashrc.

    After this I had to compile pytorch with these new environment variables so I moved in this way:

    git clone --recursive https://github.com/pytorch/pytorch.git
    cd pytorch
    python setup.py install
    

    Here I skipped pip install -r requirements.txt because I had pytorch already installed, following AMD documentation:

    pip install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/rocm6.3/
    

    If this makes problems you can just run pip install -r requirements.txt inside the pytorch folder of the git project cloned before and after that run python setup.py install.

    Right after that I run:

    tune run --nproc_per_node 2 full_finetune_distributed --config /workspace/notebooks/
    my_custom_config_distributed.yaml
    

    Finally the HIP error is gone and the training is running, at least I think so, because the last log I see during the training is:
    INFO:torchtune.utils._logging:Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ... and I don't know actually if it's blocked or it just takes some time to continue... (I'll open another question if I end up getting stuck on this).
    But anyway the HIP Error has been resolved after all so here's the code of the custom configuration yaml:

    output_dir: /workspace/notebooks/result/ # /tmp may be deleted by your system. Change it to your preference.
    
    # Tokenizer
    tokenizer:
      _component_: torchtune.models.llama3.llama3_tokenizer
      path: /workspace/notebooks/modello-preaddestrato/original/tokenizer.model
      max_seq_len: null
    
    # Dataset
    dataset:
      _component_: torchtune.datasets.chat_dataset
      source: /workspace/notebooks/datasets/dataset.json
      packed: False  # True increases speed
      conversation_column: conversations
      conversation_style: chatml
    seed: 42
    shuffle: True
      
    
    # Model Arguments
    model:
      _component_: torchtune.models.llama3_1.llama3_1_8b
    
    checkpointer:
      _component_: torchtune.training.FullModelHFCheckpointer
      checkpoint_dir: /workspace/notebooks/modello-preaddestrato/
      checkpoint_files: [
        model-00001-of-00004.safetensors,
        model-00002-of-00004.safetensors,
        model-00003-of-00004.safetensors,
        model-00004-of-00004.safetensors
      ]
      recipe_checkpoint: null
      output_dir: ${output_dir}
      model_type: LLAMA3
    resume_from_checkpoint: False
    
    # Fine-tuning arguments
    batch_size: 2
    epochs: 2
    
    optimizer:
      _component_: torch.optim.AdamW
      lr: 2e-5
      fused: True
    loss:
      _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
    max_steps_per_epoch: null
    clip_grad_norm: null
    compile: False  # torch.compile the model + loss, True increases speed + decreases memory
    optimizer_in_bwd: False  # True saves memory. Requires gradient_accumulation_steps=1
    gradient_accumulation_steps: 4  # Use to increase effective batch size
    
    # Training env
    device: cuda
    
    # Memory management
    enable_activation_checkpointing: True  # True reduces memory
    enable_activation_offloading: False  # True reduces memory
    custom_sharded_layers: ['tok_embeddings', 'output']  # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
    
    # Reduced precision
    dtype: bf16
    
    # Logging
    metric_logger:
      _component_: torchtune.training.metric_logging.DiskLogger
      log_dir: ${output_dir}/logs
    log_every_n_steps: 1
    log_peak_memory_stats: True
    
    
    # Profiler (disabled)
    profiler:
      _component_: torchtune.training.setup_torch_profiler
      enabled: False
    
      #Output directory of trace artifacts
      output_dir: ${output_dir}/profiling_outputs
    
      #`torch.profiler.ProfilerActivity` types to trace
      cpu: True
      cuda: True
    
      #trace options passed to `torch.profiler.profile`
      profile_memory: False
      with_stack: False
      record_shapes: True
      with_flops: False
    
      # `torch.profiler.schedule` options:
      # wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
      wait_steps: 5
      warmup_steps: 3
      active_steps: 2
      num_cycles: 1