pytorchnlplarge-language-modelhuggingface-trainer

how to get custom column in the model's forward() function when training with Huggingface Trainer?


I am using Huggingface Trainer to train a cumstom model subclassing a Llama llm. After tokenized by the tokenizer, my dataset has these fields 'input_ids', 'labels' and so on, and I additionally add 2 custom colunms 'interact_ids ' and 'candidate_ids '. But i can't get these custom fields in the forward() function of my Model 'class LLMWithCustomLayer(LlamaForCausalLM)'.

    def forward(
            self,
            input_ids: torch.LongTensor = None,
            attention_mask: Optional[torch.Tensor] = None,
            position_ids: Optional[torch.LongTensor] = None,
            past_key_values: Optional[List[torch.FloatTensor]] = None,
            inputs_embeds: Optional[torch.FloatTensor] = None,
            labels: Optional[torch.LongTensor] = None,
            use_cache: Optional[bool] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            return_dict: Optional[bool] = None,
            interact_ids = None,
            candidate_ids = None,
        ):
            print('interact_ids, candidate_ids', interact_ids, candidate_ids) # they are none
    
            interact_embs = []
            candidate_embs = []
            for i in range(interact_ids.shape(0)):
                # O_i = F_i (e_i)
                interact_embs.append(self.item_emb_proj(self.get_item_emb(interact_ids)))
                # O_i = F_i (e_i)
                candidate_embs.append(self.item_emb_proj(self.get_item_emb(candidate_ids)))
                # replace [CandidateEmb] and [HistoryEmb]
                inputs_embeds = self.replace_hist_candi_token(input_ids, inputs_embeds ,interact_embs, candidate_embs)
    
            return super().forward(
                input_ids=input_ids,
                attention_mask=attention_mask,
                position_ids=position_ids,
                past_key_values=past_key_values,
                inputs_embeds=inputs_embeds,
                use_cache=use_cache,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
                labels = labels
            )

I an new in LLM fine tuning. Can anyone help me? I would be grateful so much.


Solution

  • You need to modify the data collator to pass interact_ids and candidate_ids to your model, as Trainer ignores extra columns by default.

    To modify the data collator

    class CustomDataCollator(DataCollatorWithPadding):
        def __call__(self, features):
            batch = super().__call__(features)
            batch["interact_ids"] = torch.tensor([f["interact_ids"] for f in features])
            batch["candidate_ids"] = torch.tensor([f["candidate_ids"] for f in features])
            return batch
    

    then pass it to Trainer

    trainer = Trainer(
        model=LLMWithCustomLayer.from_pretrained("your-llama-model"),
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=CustomDataCollator(tokenizer)
    )
    

    Now, your forward() method will receive interact_ids and candidate_ids.

    Hope, it will work!