[SOLVED] FineTune llama3 model with torch tune gives error

FineTune llama3 model with torch tune gives error

Im trying to fine tune the llama3 model with torch tune.

these are the steps that ive already done :

1.pip install torch
2.pip install torchtune
3.tune download meta-llama/Meta-Llama-3-8B --output-dir llama3 --hf-token ***(my token)***
4.tune run lora_finetune_single_device --config llama3/8B_lora_single_device device="cpu"

and then this error happens:

INFO:torchtune.utils.logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 2
checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3-8B/original/
  checkpoint_files:
  - consolidated.00.pth
  model_type: LLAMA3
  output_dir: /tmp/Meta-Llama-3-8B/
  recipe_checkpoint: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset
  train_on_input: true
device: cpu
dtype: bf16
enable_activation_checkpointing: true
epochs: 1
gradient_accumulation_steps: 64
log_every_n_steps: null
loss:
  _component_: torch.nn.CrossEntropyLoss
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 100
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.utils.metric_logging.DiskLogger
  log_dir: /tmp/lora_finetune_output
model:
  _component_: torchtune.models.llama3.lora_llama3_8b
  apply_lora_to_mlp: false
  apply_lora_to_output: false
  lora_alpha: 16
  lora_attn_modules:
  - q_proj
  - v_proj
  lora_rank: 8
optimizer:
  _component_: torch.optim.AdamW
  lr: 0.0003
  weight_decay: 0.01
output_dir: /tmp/lora_finetune_output
profiler:
  _component_: torchtune.utils.profiler
  enabled: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B/original/tokenizer.model

DEBUG:torchtune.utils.logging:Setting manual seed to local seed 2762364121. Local seed is seed + rank = 2762364121 + 0
Writing logs to /tmp/lora_finetune_output/log_1717420025.txt
Traceback (most recent call last):
  File "/home/ggpt/.local/bin/tune", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/_cli/run.py", line 179, in _run_cmd
    self._run_single_device(args)
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/_cli/run.py", line 93, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "<frozen runpy>", line 286, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ggpt/.local/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 510, in <module>
    sys.exit(recipe_main())
             ^^^^^^^^^^^^^
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/config/_parse.py", line 50, in wrapper
    sys.exit(recipe_main(conf))
             ^^^^^^^^^^^^^^^^^
  File "/home/ggpt/.local/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 504, in recipe_main
    recipe.setup(cfg=cfg)
  File "/home/ggpt/.local/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 182, in setup
    checkpoint_dict = self.load_checkpoint(cfg_checkpointer=cfg.checkpointer)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggpt/.local/lib/python3.12/site-packages/recipes/lora_finetune_single_device.py", line 135, in load_checkpoint
    self._checkpointer = config.instantiate(
                         ^^^^^^^^^^^^^^^^^^^
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/config/_instantiate.py", line 106, in instantiate
    return _instantiate_node(config, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/config/_instantiate.py", line 31, in _instantiate_node
    return _create_component(_component_, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/config/_instantiate.py", line 20, in _create_component
    return _component_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/utils/_checkpointing/_checkpointer.py", line 517, in __init__
    self._checkpoint_path = get_path(self._checkpoint_dir, checkpoint_files[0])
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggpt/.local/lib/python3.12/site-packages/torchtune/utils/_checkpointing/_checkpointer_utils.py", line 44, in get_path
    raise ValueError(f"{input_dir} is not a valid directory.")
ValueError: /tmp/Meta-Llama-3-8B/original is not a valid directory.

should i copy the original folder from llama3 download path to /tmp folder ? its like 16g model. Can i gave the already downloaded model path to tune ?

Solution

Try to run it with an additional parameter checkpointer.checkpoint_dir. The value should be the path to the downloaded Llama model: Meta-Llama-3-8B\original

More info here: Llama3 in torchtune

tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
checkpointer.checkpoint_dir=<checkpoint_dir> \
tokenizer.path=<checkpoint_dir>/tokenizer.model \
checkpointer.output_dir=<checkpoint_dir>