conv-neural-networktransformer-modelself-attention

For an image or sequence, what is the properties transformers use?


Today my teacher ask me a question: he said the CNN is use the translation invariance of the images or matrixs. So what is the properties Transformer uses ???


Solution

  • There are two main properties of transformers that makes them so appealing compared to convolutions:

    1. A transformer is permutation equivariant. This makes transformers very useful for set predictions. For sequences and images where order does matter, positional encoding/embedding are used.
    2. The receptive field of a transformer is the entire input (!) as opposed to the very limited receptive field of a convolution layer.

    See sec. 3 and fig. 3 in:
    Shir Amir, Yossi Gandelsman, Shai Bagon and Tali Dekel Deep ViT Features as Dense Visual Descriptors (arXiv 2021).