I might be silly to ask this, but I search around in google and did not got an definite answer(maybe I should be more careful).
Given a series of bytes, what is the fastest way to know the number of 1 bits in it?
#include <array>
constexpr size_t nbytes = 256;
std::array<uint8_t, nbytes> bytes;
size_t table[256]{0, 1, 1, 2, 1, ...}; // define table from 0x00 to 0xff
size_t count1() {
size_t res = 0;
for (size_t i{0}; i < nbytes; ++i) { res += table[static_cast<int>(bytes[i])];}
return res;
}
Is there any other method that is widely trusted to always be the fastest to cover this task?
The fastest way is a bit of an aspirational goal, but a fast way would be to use std::popcount
:
std::size_t count_one_bits_naive(std::span<const std::uint8_t> bytes) {
std::size_t sum = 0;
for (std::uint8_t b : bytes)
sum += std::popcount(b);
return sum;
}
Unfortunately, GCC does not optimize this well;
it processes the array byte by byte as written,
even when the the architecture has a popcnt
instruction that can process 64 bits at a time.
A better version would at least process 64 bits at a time:
std::size_t count_one_bits_bulk(std::span<const uint8_t> bytes) {
std::size_t sum = 0;
std::size_t i = 0;
// process as many blocks of 64 bits as possible
for (; i + sizeof(uint64_t) < bytes.size(); i += sizeof(uint64_t)) {
uint64_t piece;
std::memcpy(&piece, bytes.data() + i, sizeof(uint64_t));
sum += std::popcount(piece);
}
// process the trailing bytes (at most 7)
for (; i < bytes.size(); ++i) {
sum += std::popcount(bytes[i]);
}
return sum;
}
See also Compiler Explorer
Here are some benchmarks https://quick-bench.com/q/negFVF-WaxxVfS9rejJL3NfSGCo:
Unfortunately, quick-bench doesn't let you specify architecture flags, so this is measuring using the software fallback call __popcountdi2
instead of the popcnt
instruction.
However, the benchmark demonstrates the principle that processing bytes in bulk should be faster.