cbit-manipulationbuilt-in

Count bits 1 on an integer as fast as GCC __builtin__popcount(int)


I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

int countBit1Fast(int n)
{
    int c = 0;
    for (; n; ++c)
        n &= n - 1;
    return c;
}

But a friend told me that __builtin__popcount(int) is a lot faster, but less portable. I give it a try and was MANY times faster! Why it's so fast? I want to count bits as fast as possible, but without stick to a particular compiler.

EDIT: I may use it on PIC micro-controllers and maybe on non-intel processors, so I need the maximum portability.


Solution

  • I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

    I don't see why anyone would characterize your approach as "very fast". It's a bit clever, and it should be faster on average than naive alternatives. It also does not depend on the width of the representation of int, which is a plus. I observe that it has undefined behavior for negative arguments, but that's a common theme for bitwise operators and functions.

    Let's analyze, supposing a non-negative argument:

    int c = 0;
    for (; n; ++c)
        n &= n - 1;
    

    __builtin_popcount(), on the other hand, uses a single instruction regardless of input on platforms that support it, such as yours very likely is. Even on those that don't have a for-purpose instruction, however, it can be done faster (on average).

    @dbush has presented one such mechanism that has similar advantages to the one you present. In particular, it does not depend on a pre-chosen integer width, and although it does depend on where in the representation the 1 bits reside, it does run faster for some arguments (smaller ones) than others. If I'm counting right, that one will average around 20 operations on random 32-bit inputs: five in each of four loop iterations (only 0.4% of random inputs would require fewer than four iterations). I'm counting one table read per iteration there, which I assume can be served from cache, but which is probably still not as fast as an arithmetic operation on values already held in registers.

    One that is strictly computational would be:

    int countBit1Fast(uint32_t n) {
        n = (n & 0x55555555u) + ((n >> 1) & 0x55555555u);
        n = (n & 0x33333333u) + ((n >> 2) & 0x33333333u);
        n = (n & 0x0f0f0f0fu) + ((n >> 4) & 0x0f0f0f0fu);
        n = (n & 0x00ff00ffu) + ((n >> 8) & 0x00ff00ffu);
        n = (n & 0x0000ffffu) + ((n >>16) & 0x0000ffffu);
        return n;
    }
    

    That's pretty easy to count: five additions, five shifts, and ten bitwise 'and' operations, and 5 loads of constants for a total of 25 operations for every input (and it goes up only to 30 for 64-bit inputs, though those are now 64-bit operations instead of 32-bit ones). This version is, however, intrinsically dependent on a particular size of the input data type.