In the pytorch docs on SDPBackend there are a few enums available to be used with the context manager,
ERROR: An error occurred when trying to determine the backend.
MATH: The math backend for scaled dot product attention.
FLASH_ATTENTION: The flash attention backend for scaled dot product attention.
EFFICIENT_ATTENTION: The efficient attention backend for scaled dot product attention.
CUDNN_ATTENTION: The cuDNN backend for scaled dot product attention.
What do they mean and how are they different?
What exactly is the EFFICIENT ATTENTION backend? And another is I checked with torch.backends.cuda.flash_sdp_enabled() on a machine without GPU and it is true but isn't flash attention only supposed to be for GPU's and it is based on using GPU cache memory? Is efficient attention just flash attention 2?
MATH
is the pytorch C++ attention implementation
FLASH_ATTENTION
is the attention implementation from the flash attention paper
EFFICIENT_ATTENTION
is the implementation from the facebook xformers library
CUDNN_ATTENTION
is the implementation from the Nvidia CuDNN library
You can read more about the differences here