x86avx512

Relation between Avx512_fp16 and Avx512bw (on non-Intel machines)


I am writing a program that uses vfmadd231ph (from avx512_fp16) and vpbroadcastw (from avx512bw). The program detects the CPU features at run time and dispatches to code paths (including the one that requires avx512_fp16 and avx512bw).

My question: Is avx512bw guaranteed in the presence of avx512_fp16? Having seen this post which refers to the Intel documents that "The AVX512_FP16* ISA extensions require that AVX512BW feature be implemented ..."

So, for Intel machines, it seems fine to assume this.

How about AMD machines? I could not find any info on this issue? The AMD's documentations on Avx512 is lacking compared to Intel, in general.

Clang seems to assume that this is true for all avx512_fp16 regardless of vendor. I guess it is safe to go with this.

The reason I am asking is that I am using inline assembly and that if this is not guaranteed then I will have to have separate 2x code path for when avx512bw is present or not, which I want to avoid.

Thanks

I was expecting a specification on AMD's part.

Edit 1: Also, as far as I know, there is no AMD CPU for which avx512_fp16 is present. So, I am rather asking for futures CPUs, if there will be any such CPU.

Edit 2: To be more specific, I am asking if anyone else has more info and already online documentation that I missed.


Solution

  • It's very unlikely that any vendor would make a CPU with any AVX-512 features that omitted AVX-512BW (other than Xeon Phi).

    It's part of -march=x86-64-v4 because every CPU except Xeon Phi with AVX-512F has had AVX-512BW, starting from Skylake-Xeon. It's also part of AVX10.1

    This is doubly true for a CPU implementing AVX-512FP16, which as you noted doesn't have its own broadcast instruction and is designed around the assumption that CPUs with it will also have AVX-512BW. Or at least the 16-bit element-size parts of AVX-512BW.


    You can write your CPU-feature detection code to check for both FP16 and BW just in case someone runs it in an emulator or VM with a weird mix of features enabled. But you can just fall back to not using FP16 at all in that case because it's not a mix of features that any real-world CPU aiming for commercial success would have.

    There are lots of things CPU vendors could do, but which we don't have to optimize for because they'd make the CPU a pain to use, or have problems with some existing commercially important software. This is especially true in the x86 world where backwards compatibility with existing binaries is always a selling point, and only Xeon Phi has really tried selling an x86-based CPU that wasn't intended to efficiently run existing binaries. See for example Do all CPUs which support AVX2 also support SSE4.2 and AVX? - hypothetically you could have a CPU which didn't support legacy-SSE encodings of vector instructions, but practically it's not worth worrying about.

    In very unlikely case that some future weird CPU does come out that you want to support, it'll probably need its own tuning choices anyway as well as supporting a weird feature mix. So you can wait until then to develop a version of your function for it. It's too unlikely to ever be needed to spend time writing anything now.