simdintrinsicsavx512hammingweight

Population count in AVX512


I have been trying to use _mm256_popcnt_epi64 on a machine that supports AVX512 and on code that has previously been optimiized for AVX2.

Unfortunately, I ran into the issue that the function isn't found. The corresponding __m512i equivalent is found, however. Is the __m256i function deprecated?


Solution

  • _mm512_popcnt_epi64 is part of AVX512-VPOPCNTDQ. The 256 and 128-bit versions also require AVX512VL to use AVX512 instructions with 128 or 256-bit vectors.

    Mainstream AVX512 CPUs all have AVX512-VL. Xeon Phi CPUs don't have AVX512-VL.

    (_mm512_popcnt_epi8 and epi16 are also new in Ice Lake, as part of AVX512-BITALG)

    Perhaps you forgot to enable the compiler options necessary (like GCC -march=native to enable everything the machine you're compiling on can do), or you're compiling for a target that doesn't have both features. If so, then the compiler won't have a definition for _m256_popcnt_epi64 as an intrinsic, so in C it will assume its and undeclared function and emit a call to it. (Which will of course be not found at link time.) And/or it will warn or error (C or C++) about a prototype not being found.

    Very few CPUs currently have AVX512-VPOPCNTDQ (wikipedia AVX512 feature vs. CPU matrix):


    Choosing between 256 vs. 512-bit vectors on Ice Lake is a tradeoff like on Skylake-x: when 512-bit vector uops are in flight, the vector ALUs on port 1 don't get used. And max turbo clock speed may be lowered. SIMD instructions lowering CPU frequency. So if you don't get much speedup from wider vectors (e.g. because of a memory bottleneck, or your SIMD loops are only a tiny part of a larger program), it can hurt overall performance to use 512-bit vectors in one loop.

    But note that Icelake Client CPUs aren't affected much, and I'm not sure if vpopcnt instructions even count as "heavy", maybe not reducing max turbo as much, if at all on client CPUs. Most integer SIMD instructions don't count. See discussion on LLVM [X86] Prefer 512-bit vectors on Ice/Rocket/TigerLake (PR48336). The vector ALU part of port 1 still shuts down while 512-bit uops are in flight, though.


    Other CPUs don't have hardware SIMD popcnt support at all, and no form of _mm512_popcnt_epi64 is available.

    Even if you only have AVX2, not AVX512 at all, SIMD popcnt is a win vs. scalar popcnt, over non-tiny arrays on modern CPUs with fast vpshufb (_mm256_shuffle_epi8). https://github.com/WojciechMula/sse-popcount/ has AVX2, and AVX512 versions that use vpternlogd for Harley-Seal accumulation to reduce the amount of SIMD LUT lookups for popcounting.

    Also on Stack Overflow Counting 1 bits (population count) on large data using AVX-512 or AVX-2 shows some code copied from that repo a couple years ago.

    If you need counts for separate elements separately, just use the standard unpack for vpshufb and vpsadbw against a zero vector to hsum into 64-bit qword chunks.

    If you need positional popcount (separate sum for each bit-position), see https://github.com/mklarqvist/positional-popcount.