c++binarycounting

What is the fastest way to compute number of 1 bits in an array of bytes?


I might be silly to ask this, but I search around in google and did not got an definite answer(maybe I should be more careful).

Given a series of bytes, what is the fastest way to know the number of 1 bits in it?

#include <array>

constexpr size_t nbytes = 256;
std::array<uint8_t, nbytes> bytes;

size_t table[256]{0, 1, 1, 2, 1, ...}; // define table from 0x00 to 0xff
size_t count1() {
    size_t res = 0;
    for (size_t i{0}; i < nbytes; ++i) { res += table[static_cast<int>(bytes[i])];}
    return res;
}

Is there any other method that is widely trusted to always be the fastest to cover this task?


Solution

  • The fastest way is a bit of an aspirational goal, but a fast way would be to use std::popcount:

    std::size_t count_one_bits_naive(std::span<const std::uint8_t> bytes) {
        std::size_t sum = 0;
        for (std::uint8_t b : bytes)
            sum += std::popcount(b);
        return sum;
    }
    

    Unfortunately, GCC does not optimize this well; it processes the array byte by byte as written, even when the the architecture has a popcnt instruction that can process 64 bits at a time.

    A better version would at least process 64 bits at a time:

    std::size_t count_one_bits_bulk(std::span<const uint8_t> bytes) {
        std::size_t sum = 0;
        std::size_t i = 0;
        // process as many blocks of 64 bits as possible
        for (; i + sizeof(uint64_t) < bytes.size(); i += sizeof(uint64_t)) {
            uint64_t piece;
            std::memcpy(&piece, bytes.data() + i, sizeof(uint64_t));
            sum += std::popcount(piece);
        }
        // process the trailing bytes (at most 7)
        for (; i < bytes.size(); ++i) {
            sum += std::popcount(bytes[i]);
        }
        return sum;
    }
    

    See also Compiler Explorer

    Here are some benchmarks https://quick-bench.com/q/negFVF-WaxxVfS9rejJL3NfSGCo: benchmark results, showing that the Naive implementations is 7.4x slower

    Unfortunately, quick-bench doesn't let you specify architecture flags, so this is measuring using the software fallback call __popcountdi2 instead of the popcnt instruction. However, the benchmark demonstrates the principle that processing bytes in bulk should be faster.