I read on https://github.com/huggingface/smollm/tree/main/smol_tools (mirror 1):
All models are quantized to 16-bit floating-point (F16) for efficient inference. Training was done on BF16, but in our tests, this format provides slower inference on Mac M-series chips.
Why do BF16 models have slower inference on Mac M-series chips compared to F16 models?
From https://redd.it/1glx8ul:
bf16 requires avx512 instruction set (Tacx79)
and as mentioned on knowledge.alteryx.com:
Apple Silicon (M1, M2) chips use ARM architecture, do not support AVX instructions
unlike F16, which has been around for much longer.