I have been trying to finetune a casual LM model by retraining its lm_head layer. I've been training with Deepspeed Zero stage 3 (this part works fine). But I have problem saving my finetuned model and loading it back. I think the problem is that the unwrapped_model.save_pretrained()
function automatically ignores the frozen parameters during saving. Here is my code and error messages:
# finetuning:
accelerator = Accelerator(log_with="tensorboard", project_dir=project_dir)
model: torch.nn.Module = AutoModelForCausalLM.from_pretrained("the path to LM model", trust_remote_code=True)
model.half()
model.train()
# freeze parameters
for param in model.parameters():
param.requires_grad = False
for param in model.lm_head.parameters():
param.requires_grad = True
...
# save finetuned model
if step == 5000 and accelerator.is_main_process:
unwrapped_model: PreTrainedModel = accelerator.unwrap_model(model)
save_fn = accelerator.save
unwrapped_model.save_pretrained(
"mycogagent",
is_main_process=accelerator.is_main_process,
save_function=save_fn,
)
The codes above will print a warning:
Removed shared tensor {a long list of parameter names in the original LM model except the parameter name of lm_head} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
And the saved model only takes 7MB disk space, however, I was expecting the saved model to be over 30GB. Looks like only the unfrozen part is saved to disk.
To verify my speculation, I tried to load it back with the following codes:
model = AutoModelForCausalLM.from_pretrained("mycogagent", trust_remote_code=True)
But it will result in an error of size mismatch.
RuntimeError: Error(s) in loading state_dict for CogAgentForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
I alse tried following the instruction in the error message, but it's not working, either. The program printed a list of warnings and got stuck.
Some weights of CogAgentForCausalLM were not initialized from the model checkpoint at mycogagent and are newly initialized:[a list of parameter names]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of CogAgentForCausalLM were not initialized from the model checkpoint at mycogagent and are newly initialized because the shapes did not match:
- model.embed_tokens.weight: found shape torch.Size([0]) in the checkpoint and torch.Size([32000, 4096]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The warning messages clearly suggests loading the finetuned model is unsuccessful, and the reason why the program stuck looks like another issue. But all in all, my problem is how to save the full model instead of only the finetuned parameters? What's the proper convention to save/load finetuned huggingface models ?
Expected Behaviors:
The save_pretrained function
should save all the tensors in the huggingface transformer model, even if their requires_grad
attribute is False.
--- UPDATE ---
I just located the cause of my problem. The state_dicts works fine during the entire saving process, but id_tensor_storage(tensor)
function (in site-packages/transformers/pytorch_utils.py) will not get the correct pointer to the tensor. The output of this function will always be (device(type='cuda', index=0), 0, 0)
.
In practice, the unique_id
should be equal to the memory address of the tensor instead of 0. Thus, the source of this issue must lie in accelerate.unwrap
function.
I think I found the solution. The problem is, in ZeRO3 we have to call accelerator.get_state_dict(model)
function before saving. Directly saving the model itself won't work because its parameters are stored across different GPUs. Calling accelerator.get_state_dict(model)
can force Deepspeed to collect the values of ALL parameters. There is an example [here][1]