I was curious to see whether or not MSVC used the compiler intrinsic __popcnt for bitset::count
.
Looking around, I found this to be the implementation for std::bitset::count
for VS2017:
size_t count() const _NOEXCEPT
{ // count number of set bits
const char *const _Bitsperbyte =
"\0\1\1\2\1\2\2\3\1\2\2\3\2\3\3\4"
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
"\1\2\2\3\2\3\3\4\2\3\3\4\3\4\4\5"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
"\2\3\3\4\3\4\4\5\3\4\4\5\4\5\5\6"
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
"\3\4\4\5\4\5\5\6\4\5\5\6\5\6\6\7"
"\4\5\5\6\5\6\6\7\5\6\6\7\6\7\7\x8";
const unsigned char *_Ptr = &reinterpret_cast<const unsigned char&>(_Array);
const unsigned char *const _End = _Ptr + sizeof (_Array);
size_t _Val = 0;
for ( ; _Ptr != _End; ++_Ptr)
_Val += _Bitsperbyte[*_Ptr];
return (_Val);
}
It looks like its using lookup tables to get the number of bits for any given byte, and then counts the number of 1's for every byte.
According to this answer here, GCC implements it like this (along the lines of what I was thinking):
/// Returns the number of bits which are set.
size_t
count() const { return this->_M_do_count(); }
size_t
_M_do_count() const
{
size_t __result = 0;
for (size_t __i = 0; __i < _Nw; __i++)
__result += __builtin_popcountl(_M_w[__i]);
return __result;
}
Although I didn't benchmark anything, I would bet good money that GCC's implementation would be quite a bit faster here.
Hence, are there any compelling reason why MSVC implemented std::bitset::count
like this? My guess is either MSVC has a catch-all "no compiler intrinsics in STL" policy, or there's a difference between the two platforms that I'm overlooking.
Internal implementation of __builtin_popcountl
in GCC is not better, it is something like below depending on architecture.
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24;
And only for SSE4a instruction set, supported only in AMD CPUs starting from the Year 2006, __builtin_popcountl
consists of one assembler instruction POPCNT
.
MSDN says
Each of these intrinsics generates the popcnt instruction. The size of the value that the popcnt instruction returns is the same as the size of its argument. In 32-bit mode there are no 64-bit general-purpose registers, hence no 64-bit popcnt.
To determine hardware support for the popcnt instruction, call the __cpuid intrinsic with InfoType=0x00000001 and check bit 23 of CPUInfo[2] (ECX). This bit is 1 if the instruction is supported, and 0 otherwise. If you run code that uses this intrinsic on hardware that does not support the popcnt instruction, the results are unpredictable.
I assume MSVC team did not want to use the intrinsic with conditions in favor of one common solution independent from CPUs and architectures.