pythonlarge-language-modelllamaattention-modelfine-tuning

Llama_cookbook: why are labels not shifted for CausalLM?


I'm studying the llama_cookbok repo, in particular their finetuning example. This example uses LlamaForCausalLM model and samsum_dataset (input: dialog, output: summary). Now, looking at how they process the dataset, if we look at the "labels" part:

Code:

prompt = tokenizer.encode(tokenizer.bos_token + sample["prompt"], add_special_tokens=False)
summary = tokenizer.encode(sample["summary"] +  tokenizer.eos_token, add_special_tokens=False)

sample = {
    "input_ids": prompt + summary,
    "attention_mask" : [1] * (len(prompt) + len(summary)),
    "labels": [-100] * len(prompt) + summary,
    }

I also see this when printing the actual samples with the Dataloader they create (Note: if you want to reproduce the behavior, you'll have to change the path to the dataset to knkarthick/samsum in the package internals; or maybe use a different dataset):

batch = next(iter(train_dataloader))
print(batch['input_ids'][0][35:40])
print(batch['labels'][0][35:40])

Output:
tensor([19791,   512,    32, 36645, 41778])
tensor([ -100,  -100,    32, 36645, 41778])

Why is the summary part correct? I thought that for CausalLM we must have labels[i] = input_ids[i + 1] for every label we want to predict.


Solution

  • CausalLM paradigm suggests that (ideally) labels[i] = generated_text(input_ids[0:i]) i.e. CausalLM predicts the ith token based upon the previous i-1 tokens, so if we look at the whole sequence, each labels[i] would be equal to input_ids[i] and there is no error in the cookbook. You confusion probably comes from 0-based indexing arrays, as input_ids[0:i] does not include ith element itself.

    In order to calculate the loss (usually by using CrossEntropyLoss) one disregards the "input" part and only looks at the differences between the desired and the generated results. This is why labels vector is filled with -100 on the "prompt" part. This value indicates to the loss function to ignore these indices, actively serving as a mask.

    For efficiency the model first generate the whole sequence and the loss is calculated later on the whole sequence. So one get a vector of generated tokens and a vector of labels, and compares between them. In case the generated result is shorter than the "labels" it is padded, and if it is longer, it is truncated.