In Transformer models, sequences of variable lengths are typically padded to the maximum length in a batch. However, if my sequence lengths vary significantly, the batch may contain a substantial amount of padding (potentially over 50%).
I am curious about the following:
When PyTorch computes the Transformer, do padding tokens impact calculation speed negatively? Does the presence of the attention mask allow the model to effectively skip over padding tokens, resulting in only a minimal performance impact?
Overall, how effective is the attention mask? If I have a sparse attention mask with only 10% non-zero values, does the computation effectively reduce to approximately 10%?
Thank you for your insights!
Attention is computed on a tensor of shape (batch_size, sequence_length, embedding_dimension)
. The compute and memory requirements scale with the size of those dimensions.
For an input of fixed size, the percent padding does not impact performance. There is some minor overhead from applying a padding mask at all (ie not having a padding mask saves you one mask fill operation), but between x% padding and y% padding you're not going to see a difference. The overall compute requirements are set by the tensor size.
With respect to batching sequences, there can be added inefficiencies for batching together sequences of wildly different length. Say you have 10 sequences of length 8
and 10 sequences of length 128
. Now pad and batch those sequences into two batches. If you mix lengths evenly, you get two batches with a sequence length of 128
. If you sort by length before batching, you get one batch with sequence length of 8
and another with length 128
. The first case (two batches of sequence length 128) requires overall more compute compared to the second case (one batch of 8, one of 128).
That said, for a fixed input size, you aren't going to see a performance change from the percent padding. There is no way for the attention operation to "skip over" padding tokens. The conditional control flow required for that sort of approach doesn't work well with the way GPUs execute operations in parallel. The only effect of the padding mask is it assigns 0 attention weight to padding tokens.