pythonpython-3.xpytorch

What is the difference between various backends in torch.nn.attention.SDPBackend, and what do they mean?


In the pytorch docs on SDPBackend there are a few enums available to be used with the context manager,

ERROR: An error occurred when trying to determine the backend.
MATH: The math backend for scaled dot product attention.
FLASH_ATTENTION: The flash attention backend for scaled dot product attention.
EFFICIENT_ATTENTION: The efficient attention backend for scaled dot product attention.
CUDNN_ATTENTION: The cuDNN backend for scaled dot product attention.

What do they mean and how are they different?

What exactly is the EFFICIENT ATTENTION backend? And another is I checked with torch.backends.cuda.flash_sdp_enabled() on a machine without GPU and it is true but isn't flash attention only supposed to be for GPU's and it is based on using GPU cache memory? Is efficient attention just flash attention 2?


Solution

  • MATH is the pytorch C++ attention implementation

    FLASH_ATTENTION is the attention implementation from the flash attention paper

    EFFICIENT_ATTENTION is the implementation from the facebook xformers library

    CUDNN_ATTENTION is the implementation from the Nvidia CuDNN library

    You can read more about the differences here