I'm new on AI developing and I'm trying to train a model, the thing is that there are two AMD GPUs on the server that are Radeon RX 7600 XT and the CPU is a Ryzen 9 5900XT 16-Core and I already had several problems when I wrote the training code from scratch (one of them was OOM on GPU, I couldn't use both GPUs because of libraries compatibility between torch and rocm). I changed my approach and I followed the official AMD Documentation for AI Training. After all the setup explained on this guide and run the command that starts the training this is the Traceback:
Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Writing logs to /workspace/notebooks/result/logs/log_1744287800.txt
Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in <module>
[rank1]: sys.exit(recipe_main())
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank1]: sys.exit(recipe_main(conf))
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main
[rank1]: recipe.setup(cfg=cfg)
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup
[rank1]: self._model = self._setup_model(
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model
[rank1]: m.rope_init()
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init
[rank1]: ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim)
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank1]: return func(*args, **kwargs)
[rank1]: RuntimeError: HIP error: invalid device function
[rank1]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank1]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in <module>
[rank0]: sys.exit(recipe_main())
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank0]: sys.exit(recipe_main(conf))
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main
[rank0]: recipe.setup(cfg=cfg)
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup
[rank0]: self._model = self._setup_model(
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model
[rank0]: m.rope_init()
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init
[rank0]: ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim)
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: RuntimeError: HIP error: invalid device function
[rank0]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank0]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
[rank0]:[W410 12:23:26.761619117 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0410 12:23:27.490000 7605 site-packages/torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 7738) of binary: /opt/conda/envs/py_3.10/bin/python3
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/bin/tune", line 8, in <module>
sys.exit(main())
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
parser.run(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
args.func(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
run(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py FAILED
The whole thing is run into a docker as per documentation. I tried to write an environment variable on bashrc to set it permanently because I thought the problem would it be that my gfx was not supported:
echo "export PYTORCH_ROCM_ARCH=gfx1102" >> ~/.bashrc source ~/.bashrc
then clone the git project locally, then
cd /workspace/pytorch
pip install -r requirements.txt
python setup.py install
}
So ultimately I retried to run the training with tune run --nproc_per_node 2 full_finetune_distributed --config /workspace/notebooks my_custom_config_distributed.yaml
but it gave me the same error.
There is an official AMD page where they wrote a table for compatible GPUs prerequisites in order to make it and my GPU isn't in that table but I found online someone that tried to follow this guide even if their GPU wasn't on that table and they said they pulled it off somehow...
I know that there is a possibility to do it, but I can't find out how.
P.S. The server it's a Linux Ubuntu, more information:
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)".
Referring to the discussion that took place in the comments under my own question, I'm sharing here the complete code for better clarity. But first I'll resume what we have discussed in the comments.
In the comments, we discussed github issue where other people had my same problem.
I followed some solutions posted there from users and here's the crucial adjustments that finally solved the HIP Error.
In the /workspace folder inside the docker, I first updated the .bashrc with new environment variables:
with nano ~/.bashrc
I updated this variable:
export PYTORCH_ROCM_ARCH=gfx1102
Into this:
export PYTORCH_ROCM_ARCH=gfx1031
as someone wrote on the git issue.
Then I added a crucial environment variable which is:
export HSA_OVERRIDE_GFX_VERSION=10.3.1
, and obviously source ~/.bashrc
.
After this I had to compile pytorch with these new environment variables so I moved in this way:
git clone --recursive https://github.com/pytorch/pytorch.git
cd pytorch
python setup.py install
Here I skipped pip install -r requirements.txt because I had pytorch already installed, following AMD documentation:
pip install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/rocm6.3/
If this makes problems you can just run pip install -r requirements.txt
inside the pytorch folder of the git project cloned before and after that run python setup.py install
.
Right after that I run:
tune run --nproc_per_node 2 full_finetune_distributed --config /workspace/notebooks/
my_custom_config_distributed.yaml
Finally the HIP error is gone and the training is running, at least I think so, because the last log I see during the training is:
INFO:torchtune.utils._logging:Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
and I don't know actually if it's blocked or it just takes some time to continue... (I'll open another question if I end up getting stuck on this).
But anyway the HIP Error has been resolved after all so here's the code of the custom configuration yaml:
output_dir: /workspace/notebooks/result/ # /tmp may be deleted by your system. Change it to your preference.
# Tokenizer
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /workspace/notebooks/modello-preaddestrato/original/tokenizer.model
max_seq_len: null
# Dataset
dataset:
_component_: torchtune.datasets.chat_dataset
source: /workspace/notebooks/datasets/dataset.json
packed: False # True increases speed
conversation_column: conversations
conversation_style: chatml
seed: 42
shuffle: True
# Model Arguments
model:
_component_: torchtune.models.llama3_1.llama3_1_8b
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /workspace/notebooks/modello-preaddestrato/
checkpoint_files: [
model-00001-of-00004.safetensors,
model-00002-of-00004.safetensors,
model-00003-of-00004.safetensors,
model-00004-of-00004.safetensors
]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: LLAMA3
resume_from_checkpoint: False
# Fine-tuning arguments
batch_size: 2
epochs: 2
optimizer:
_component_: torch.optim.AdamW
lr: 2e-5
fused: True
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
clip_grad_norm: null
compile: False # torch.compile the model + loss, True increases speed + decreases memory
optimizer_in_bwd: False # True saves memory. Requires gradient_accumulation_steps=1
gradient_accumulation_steps: 4 # Use to increase effective batch size
# Training env
device: cuda
# Memory management
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: False # True reduces memory
custom_sharded_layers: ['tok_embeddings', 'output'] # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.
# Reduced precision
dtype: bf16
# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True
# Profiler (disabled)
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False
#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs
#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True
#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False
# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1