macosperformanceneural-networkinferencehalf-precision-float

Why do BF16 models have slower inference on Mac M-series chips compared to F16 models?


I read on https://github.com/huggingface/smollm/tree/main/smol_tools (mirror 1):

All models are quantized to 16-bit floating-point (F16) for efficient inference. Training was done on BF16, but in our tests, this format provides slower inference on Mac M-series chips.

Why do BF16 models have slower inference on Mac M-series chips compared to F16 models?


Solution

  • From https://redd.it/1glx8ul:

    bf16 requires avx512 instruction set (Tacx79)

    and as mentioned on knowledge.alteryx.com:

    Apple Silicon (M1, M2) chips use ARM architecture, do not support AVX instructions

    unlike F16, which has been around for much longer.