[SOLVED] Count bits 1 on an integer as fast as GCC __builtin_

Count bits 1 on an integer as fast as GCC __builtin__popcount(int)

I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

int countBit1Fast(int n)
{
    int c = 0;
    for (; n; ++c)
        n &= n - 1;
    return c;
}

But a friend told me that __builtin__popcount(int) is a lot faster, but less portable. I give it a try and was MANY times faster! Why it's so fast? I want to count bits as fast as possible, but without stick to a particular compiler.

EDIT: I may use it on PIC micro-controllers and maybe on non-intel processors, so I need the maximum portability.

Solution

I write a algorithm (taken from "The C Programming Language") that counts the number of 1-bits very fast:

I don't see why anyone would characterize your approach as "very fast". It's a bit clever, and it should be faster on average than naive alternatives. It also does not depend on the width of the representation of int, which is a plus. I observe that it has undefined behavior for negative arguments, but that's a common theme for bitwise operators and functions.

Let's analyze, supposing a non-negative argument:

int c = 0;
for (; n; ++c)
    n &= n - 1;

How many loop iterations are performed?

1 for each 1 bit in the binary representation of the value, irrespective of where in the value each bit lies
How much work is performed per iteration
- one increment of c
- one comparison of n against zero (plus one more of these when breaking out of the loop)
- one decrement of n by 1
- one bitwise 'and'
That ignores reads and stores, which very likely can be made free or especially cheap by keeping the operands in registers. If we assume equal cost for each of those, that's four operations per iteration. For random 32-bit integers, there will be an average of 16 iterations, for a total of 65 operations on average. (Best case is just one operation, but worst is 129, which is no better than a naive implementation).

__builtin_popcount(), on the other hand, uses a single instruction regardless of input on platforms that support it, such as yours very likely is. Even on those that don't have a for-purpose instruction, however, it can be done faster (on average).

@dbush has presented one such mechanism that has similar advantages to the one you present. In particular, it does not depend on a pre-chosen integer width, and although it does depend on where in the representation the 1 bits reside, it does run faster for some arguments (smaller ones) than others. If I'm counting right, that one will average around 20 operations on random 32-bit inputs: five in each of four loop iterations (only 0.4% of random inputs would require fewer than four iterations). I'm counting one table read per iteration there, which I assume can be served from cache, but which is probably still not as fast as an arithmetic operation on values already held in registers.

One that is strictly computational would be:

int countBit1Fast(uint32_t n) {
    n = (n & 0x55555555u) + ((n >> 1) & 0x55555555u);
    n = (n & 0x33333333u) + ((n >> 2) & 0x33333333u);
    n = (n & 0x0f0f0f0fu) + ((n >> 4) & 0x0f0f0f0fu);
    n = (n & 0x00ff00ffu) + ((n >> 8) & 0x00ff00ffu);
    n = (n & 0x0000ffffu) + ((n >>16) & 0x0000ffffu);
    return n;
}

That's pretty easy to count: five additions, five shifts, and ten bitwise 'and' operations, and 5 loads of constants for a total of 25 operations for every input (and it goes up only to 30 for 64-bit inputs, though those are now 64-bit operations instead of 32-bit ones). This version is, however, intrinsically dependent on a particular size of the input data type.