I have been trying to use _mm256_popcnt_epi64 on a machine that supports AVX512 and on code that has previously been optimiized for AVX2.
Unfortunately, I ran into the issue that the function isn't found. The corresponding __m512i
equivalent is found, however. Is the __m256i
function deprecated?
_mm512_popcnt_epi64
is part of AVX512-VPOPCNTDQ. The 256 and 128-bit versions also require AVX512VL to use AVX512 instructions with 128 or 256-bit vectors.
Mainstream AVX512 CPUs all have AVX512-VL. Xeon Phi CPUs don't have AVX512-VL.
(_mm512_popcnt_epi8
and epi16 are also new in Ice Lake, as part of AVX512-BITALG)
Perhaps you forgot to enable the compiler options necessary (like GCC -march=native
to enable everything the machine you're compiling on can do), or you're compiling for a target that doesn't have both features. If so, then the compiler won't have a definition for _m256_popcnt_epi64
as an intrinsic, so in C it will assume its and undeclared function and emit a call to it. (Which will of course be not found at link time.) And/or it will warn or error (C or C++) about a prototype not being found.
Very few CPUs currently have AVX512-VPOPCNTDQ (wikipedia AVX512 feature vs. CPU matrix):
Knight's Mill (final-generation Xeon Phi): only AVX512-VPOPCNTDQ, no AVX512VL and no BITALG. So only the __m512i
versions are available for gcc -O3 -march=knm
. You should definitely be using 512-bit vectors on Xeon Phi unless data layout works perfectly for 256 and would take extra shuffling for 512-bit. But beware that it's slow for some AVX / AVX2 instructions that it doesn't have 512-bit versions of, like shuffles with elements smaller than 32-bit. (No AVX512 BW)
Ice Lake / Tiger Lake: has AVX512 VPOPCNTDQ, BITALG, and AVX512 VL, so _mm256_popcnt_epi64
and epi8
are supported when compiling for this target microarchitecture, e.g. gcc -O3 -march=icelake-client
. (Assuming your compiler's headers are correct).
GCC8.3 and earlier have a bug where -march=icelake-client
/ icelake-server
doesn't enable -mavx512vpopcntdq
. (GCC7 doesn't know about -march=icelake-client
). It's fixed in GCC8.4, so either upgrade to the latest GCC8, or better upgrade to the latest stable GCC; a couple more years of development should usually help GCC make better code with new ISA extensions like AVX-512, especially with mask registers. Or just manually use -march=icelake-client -mavx512vpopcntdq
; that does work: https://godbolt.org/z/a7bhcjdhr
Choosing between 256 vs. 512-bit vectors on Ice Lake is a tradeoff like on Skylake-x: when 512-bit vector uops are in flight, the vector ALUs on port 1 don't get used. And max turbo clock speed may be lowered. SIMD instructions lowering CPU frequency. So if you don't get much speedup from wider vectors (e.g. because of a memory bottleneck, or your SIMD loops are only a tiny part of a larger program), it can hurt overall performance to use 512-bit vectors in one loop.
But note that Icelake Client CPUs aren't affected much, and I'm not sure if vpopcnt
instructions even count as "heavy", maybe not reducing max turbo as much, if at all on client CPUs. Most integer SIMD instructions don't count. See discussion on LLVM [X86] Prefer 512-bit vectors on Ice/Rocket/TigerLake (PR48336). The vector ALU part of port 1 still shuts down while 512-bit uops are in flight, though.
Other CPUs don't have hardware SIMD popcnt support at all, and no form of _mm512_popcnt_epi64
is available.
Even if you only have AVX2, not AVX512 at all, SIMD popcnt is a win vs. scalar popcnt
, over non-tiny arrays on modern CPUs with fast vpshufb
(_mm256_shuffle_epi8
). https://github.com/WojciechMula/sse-popcount/ has AVX2, and AVX512 versions that use vpternlogd
for Harley-Seal accumulation to reduce the amount of SIMD LUT lookups for popcounting.
Also on Stack Overflow Counting 1 bits (population count) on large data using AVX-512 or AVX-2 shows some code copied from that repo a couple years ago.
If you need counts for separate elements separately, just use the standard unpack for vpshufb
and vpsadbw
against a zero vector to hsum into 64-bit qword chunks.
If you need positional popcount (separate sum for each bit-position), see https://github.com/mklarqvist/positional-popcount.