Huggingface GPT2 loss understanding

(Also posted here https://discuss.huggingface.co/t/newbie-understanding-gpt2-loss/33590)

I am getting stuck with understanding the GPT2 loss. I want to give the model the label having the target it will generate so that I can see that loss is zero.

I have a input text input_text = "Welcome to New York" The current model predicts the next word as City The loss will never be zero if I give the label as input_text. How do I simulate giving the label "Welcome to New York City" so that the internal neural net (irrespective of the model) will give a loss of zero or near that?

To explain more what I mean, here is the snippet.

Note - I have read the forum and documents that the labels can be the same as the input text, that the model will shift left the labels, and that the loss is not calculated for the last token. But then still loss should become zero, which it is not.

Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids....

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name,model_max_length=1024,padding_side='left')
tokenizer.pad_token = tokenizer.eos_token # == <|endoftext|> = 50256
model = GPT2LMHeadModel.from_pretrained(model_name)

batch_size=5
input_text  = "<|endoftext|> Welcome to New York"
target_text = "Welcome to New York City"

# encode the inputs
encoding = tokenizer(input_text,padding=True,max_length=batch_size,truncation=True,return_tensors="pt",)
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
# encode the targets
target_encoding = tokenizer(target_text,padding=True, max_length=batch_size, truncation=True,return_tensors="pt",)
labels = target_encoding.input_ids
# replace padding token id's of the labels by -100 so it's ignored by the loss
labels[labels == tokenizer.pad_token_id] = -100  # in our case there is no padding
print(f"input_ids={input_ids}")
print(f"attention_mask={attention_mask}") # all ones
print(f"labels ={labels}")
# forward pass
outputs = model(input_ids=input_ids,labels=labels) 
print(f"Model Loss {outputs.loss}")
# Test the model to check what it predicts next
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask,max_new_tokens=1)
answer = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(f"Result '{answer}'")

Output

input_ids=tensor([[50256, 19134,   284,   968,  1971]]) # not sure what eostoken (50256) in input does to model
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971,  2254]]) # 2254 = City;  which is that the model should predict
Model Loss 8.248174667358398
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result '<|endoftext|> Welcome to New York City'

When I try something proper as is done everywhere

input_text  = "Welcome to New York"
target_text = input_text

I get a loss of about 3.26

input_ids=tensor([[14618,   284,   968,  1971]]) # 1971 = York
attention_mask=tensor([[1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971]])

Model Loss 3.2614505290985107
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result 'Welcome to New York City'

Is it that

outputs = model(input_ids=input_ids, labels=labels)

is generating more than 1 token.

Updated-

Based on the answer by Jindfitch - Putting it here as the SO moderators have delted when I try to add that as answer.

You try to fine-tune the model to be absolutely sure that City will follow with 100% probability

I trained the GPT2 with this particular text (trained only the last 2 layers and froze the others) and took the model whose loss was the lowest and used that tested again, and sure enough, the loss was much lower - Model Loss 0.01076329406350851

For anyone else who would like to follow. The training code is below.

Note training with this small text and the way I have done I am not really fully sure if it is proper, as the training loss seemed to jump around a bit (that is increased after some epochs, i this case Epoch 8)

2023-03-12 16:03:20,579 [INFO] Epoch 7 complete. Loss: 0.18975284695625305 saving ./test/gpt2-epoch-8-2023-03-12 16:02:19.289492
2023-03-12 16:03:20,985 [INFO] Epoch 9 of 10
2023-03-12 16:03:27,655 [INFO] Epoch 8 complete. Loss: 0.3775772750377655 saving ./test/gpt2-epoch-9-2023-03-12 16:02:19.289492
2023-03-12 16:03:27,655 [INFO] Epoch 10 of 10
2023-03-12 16:03:34,140 [INFO] Epoch 9 complete. Loss: 6.827305332990363e-05 saving ./test/gpt2-epoch-10-2023-03-12 16:02:19.289492

Training script - https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/gpt2_train_model.py

Training Output log https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/training/training_2023-03-12%2016%3A02%3A19.289492.log

Training data Welcome to New York City (space in the end) https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/data/small.txt

Eval script - https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/older/gpt2_loss_learn.py

I removed the token corresponding to 'City' from Input-ids when giving the model to generate

# remove the last token off for input-id's as well as attention Mask
input_ids = input_ids[:,:-1] # input_text  = "Welcome to New York"
attention_mask = attention_mask[:,:-1]
print(f"input_ids={input_ids}")
outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask,max_new_tokens=1)

Eval Script Output

python3 ./older/gpt2_loss_learn.py 
input_ids=tensor([[14618,   284,   968,  1971,  2254]])
attention_mask=tensor([[1, 1, 1, 1, 1]])
labels =tensor([[14618,   284,   968,  1971,  2254]])
Model Loss 0.01076329406350851
input_ids=tensor([[14618,   284,   968,  1971]])
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Result 'Welcome to New York City'

A much more illustrative example https://github.com/alexcpn/tranformer_learn/blob/gpt-loss-learn/LLM_Loss_Understanding.ipynb

Solution

The default loss function is negative log-likelihood. The actual model output is not the token City but a categorical distribution over the entire 50k vocabulary. Depending on the generation strategy, you either sample from these distributions or take the most probable token.

The token City, apparently the most probable one, gets some probability, and the loss is then minus the logarithm of this probability. Loss close to zero would mean the token would get a probability close to one. However, the token distribution also considers many plausible but less likely follow-ups. Loss 3.26 corresponds to the probability of exp(-3.26), approximately 3.8%. It seems small, but in a 50k vocabulary, it is approximately 2000 times more probable than a random guess.

You can try to fine-tune the model to be absolutely sure that City will follow with 100% probability, but it would probably break other language modeling capabilities.