seq2seqencoder-decoder

What are differences between T5 and Bart?


I have a question regarding T5 and BART. It seems they are very similar in the bird's eye view, but I want to know what the differences between them are delicately. As far as I know they are both seq2seq models and there are no differences in architecture (both use encoders and decoders of origianl transformer) The difference lies in the training method + form of input, but I could not detail it further.

Can anyone help me with detailed explanation?


Solution

  • Similarity: Both models are encoder-decoder models. Both models are suitable for most seq2seq tasks such as summarization, translation QA tasks, comprehension tasks, etc. Both of them issued in 2019) T5 by Google, BART by Facebook AI

    Differences:

    1. pretraining objective:

      • T5 pretraining objective randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens that are added to our vocabulary and do not correspond to any word piece. The target then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence (read section 3.1.4 Unsupervised Objective in t5 paper)

      • BART is trained by corrupting documents and then optimizing a reconstruction loss, the cross-entropy between the decoder’s output and the original document. BART allows to application of any type of document corruption. In the extreme case, where all information about the source is lost. In detail, the pretraining process is explained in section 2.2 Pre-training BART in the BART paper linked below.

    2. Activation functions:

      • T5 initially used RELU later there were other activation functions such as GELU .. take a look into the implementation for precise understanding.

      • Bart following GPT, authors modified ReLU activation functions to GeLUs

    3. parameter initialization:

      • Bart initializes parameters from N (0, 0.02)
      • T5 N(0, 1/sqrt(d_model))
    4. Pretraining corpus:

      • T5 used c4.
      • BART pre-trained directly on data sets of SQuAD (an extractive question answering task on Wikipedia paragraph), MNLI(a bitext classification task to predict whether one sentence entails another), ELI5(a long-form abstractive question answering dataset), XSum(summarization) ConvAI2(dialogue response generation task, conditioned on context and a persona), CNN/DM(summarization) tasks.
    5. positional encoding:

      • T5 uses relative position embeddings.
      • BART uses absolute position embeddings.
    6. As usual both models use different tokenizers.

    In short: BART is more of a pre-training approach that learns to map corrupted documents to the original as the main difference of the T5 model because both of them are encoder-decoder transformers.

    Also, you can find more information by reading the linked resources. If you want more fine-grade technical differences look into the model's implementation too on hugging face.

    Resources: