pytorchhuggingface-transformershuggingface-tokenizershuggingfacehuggingface-datasets

Setting padding token as eos token when using DataCollatorForLanguageModeling from HuggingFace


In https://huggingface.co/learn/nlp-course/chapter7/6#preparing-the-dataset, there is

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

What the tutorial is doing is using a pretrained GPT2 model and its tokenizer and trying to create a dataset for causal language modeling pretraining task.

My question with the above line is that padding token is set to be the eos token. As a result even the original eos tokens will be ignored by the model during training since they will be perceived as padding tokens too.

This would prevent my model from learning to output eos tokens when its generation is over.

How come this is in the tutorials and it is a correct way ?


Solution

  • TL;DR

    Ignoring the EOS symbol when training a normal language model is okay. So padding the sequence with EOS instead of a dedicated PAD symbol is okay too.


    In Long

    When using DataCollatorForLanguageModeling(tokenizer, mlm=False), the "masked-language modeling" model is off and we are doing casual language modeling ,i.e. predicting the next word given the previous. Consider this:

    ['this', 'is', 'a', 'foobar', '.', 'EOS']
    

    Now we pad the sequence until it's of length 10 tokens

    ['this', 'is', 'a', 'foobar', '.', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS']
    

    When the model learns with causal language model, it's predicting the next word given the previous, i.e.

    >>> predict(next_token, given=["BOS"])
    'this'
    
    >>> predict(next_token, given=["BOS", "this"])
    'is'
    
    ...
    
    >>> predict(next_token, given=["BOS", "this", "is", "a", "foobar", "."])
    'EOS'
    

    In most common inference routine, the model will stop once the first EOS is predicted, or all beams in the search during inference produced their first EOS.

    During training, the model will learn:

    ground_truth = [
     'this', 'is', 'a', 'foobar', '.', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 
    ]
    
    ground_prediction = [
     'this', 'is', 'foobar', '.', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 
    ]
    

    And when you compute the perplexity, all the PAD symbols are ignored, and in this case, when you treat the EOS as PAD, you are essentially tell the model even the first EOS is not necessary when computing perplexity.

    Q: Is that the right thing to do to ignore even the first EOS token, when we use EOS as a padding token?

    A: It depends on your task and what you want the 'EOS' to mean. For most natural language, we have punctuations before 'EOS', so EOS/PAD doesn't really matter. For programming language, we have '\n' and ';' or some end of sequence operator, so EOS isn't that necessary too.

    Q: Then why do we bother to pad?

    A: Actually that's a good question, we're padding so that the dot-products in transformer attentions can be "easily" computed.

    But there are many cases where pad tokens can be efficiently packed, like in RNN https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html (IIRC, not in transformers architecture though)

    But I don't know how much of that is already in Pytorch/JAX underlying library for "efficient" transformers, which will allow us to avoid pre-padding inputs. From my experience in using Huggingface Pytorch models, if you don't pad the inputs, most probably the model will complain when you do a forward pass =(

    If only, someone fix that mathematically. Maybe someone did try but it's not that common to be largely used by most transformers pre-trained model (yet).