nlphuggingface-transformerspre-trained-model

GPT-2 and other models from huggingface -100 label index for training, instead of pad token


I understand the -100 label id is used so that the predictions for these are not included when calculating the loss.

However on huggingface, they state "complicated list comprehension here because pad_token_id alone is not good enough to know whether label should be excluded or not", when replacing pad tokens. In their implementation, they use nn.CrossEntropyLoss(), which has an argument "ignore_index".

Is there any benefit to changing the id to -100 as opposed to adding the argument ignore_index in the loss and setting it as the pad token id? Or are the results the same?

The way it is written makes me think there is some benefit, but the description of "ignore_index" appears to achieve what is wanted.


Solution

  • The author of the tutorial you mentioned sets it to -100 and uses ignore_index to save a few lines of code. You don't see the line where the author pass something to ignore_index because it has a default value. The default value of ignore_index for nn.CrossEntropyLoss is -100. Using this value instead of the respective pad token id allows you to write some model indepent training code and you don't have to pass the pad token id from tokenizer down to the loss function.