Why are these bitset operations faster than array operations?

I'm messing around with various ways to count primes in C, and I thought a simple way to save space in a simple sieve of Eratosthenes is to replace a bool *is_prime with a bitset. I used the C FAQ bitset implementation. What surprised me is that it also saved significant time (1.1 seconds to 0.7 seconds for a limit of 100_000_000). Below is my code for the sieve using an array (including some time-saving optimizations):

unsigned long array_eratosthenes(unsigned long limit) {
    if (limit < 2) {
        return 0;
    }
    if (limit == 2) {
        return 1;
    }
    bool *not_prime = calloc(limit + 1, sizeof(bool));
    for (unsigned long i = 4; i <= limit; i += 2) {
        not_prime[i] = true;
    }
    for (unsigned long i = 9; i <= limit; i += 3) {
        not_prime[i] = true;
    }

    unsigned long count = 2;
    for (unsigned long n = 5; n <= limit; n += 6) {
        if (!not_prime[n]) {
            ++count;
            for (unsigned long x = n * n; x <= limit; x += n) {
                not_prime[x] = true;
            }
        }
        unsigned long n_plus_2 = n + 2;
        if (n_plus_2 > limit) {
            break;
        }
        if (!not_prime[n_plus_2]) {
            ++count;
            for (unsigned long x = n_plus_2 * n_plus_2; x <= limit;
                 x += n_plus_2) {
                not_prime[x] = true;
            }
        }
    }

    free(not_prime);
    return count;
}

And the code for the sieve using a bitset:

#define BITMASK(b) (1 << ((b) % CHAR_BIT))
#define BITSLOT(b) ((b) / CHAR_BIT)
#define BITSET(a, b) ((a)[BITSLOT(b)] |= BITMASK(b))
#define BITCLEAR(a, b) ((a)[BITSLOT(b)] &= ~BITMASK(b))
#define BITTEST(a, b) ((a)[BITSLOT(b)] & BITMASK(b))
#define BITNSLOTS(num_bits) (((num_bits) + CHAR_BIT - 1) / CHAR_BIT)

unsigned long bitset_eratosthenes(unsigned long limit) {
    if (limit < 2) {
        return 0;
    }
    if (limit == 2) {
        return 1;
    }
    char *not_prime = calloc(BITNSLOTS(limit + 1), sizeof(char));
    for (unsigned long i = 4; i <= limit; i += 2) {
        BITSET(not_prime, i);
    }
    for (unsigned long i = 9; i <= limit; i += 3) {
        BITSET(not_prime, i);
    }

    unsigned long count = 2;
    for (unsigned long n = 5; n <= limit; n += 6) {
        if (!BITTEST(not_prime, n)) {
            ++count;
            for (unsigned long x = n * n; x <= limit; x += n) {
                BITSET(not_prime, x);
            }
        }
        unsigned long n_plus_2 = n + 2;
        if (n_plus_2 > limit) {
            break;
        }
        if (!BITTEST(not_prime, n_plus_2)) {
            ++count;
            for (unsigned long x = n_plus_2 * n_plus_2; x <= limit;
                 x += n_plus_2) {
                BITSET(not_prime, x);
            }
        }
    }

    free(not_prime);
    return count;
}

The time saved from callocing a smaller array cannot be that substantial, and the bitset implementation likely performs the same number of array operations as the array implementation, with additional overhead from the bit operations. Does anyone know why the bitset implementation would be faster?

Solution

There is an easy speedup of the FAQ bitset implementation which avoids a lot of repeated computation of bit masks as 1<<N for 0<=N<=7. Namely replacing them by a constant array lookup table of set bits. It shaves about 1 cycle off the loop time of typically 10 cycles. I also moved the memory allocation outside the timing loop since it makes things far too erratic. Times shown below are averages of 5 excluding bad outliers (Windows is nothing like a realtime system and the odd point is wildly wrong in any given run). You should get something very similar in a single run but with one or two spikes. The exact curve shape will depend on how much cache you have in each of L1D, L2 and L3.

I have also added macros to allow you to choose other larger sizes for boolean and explore how doubling the size of a bool affects performance (be careful if your machine has less than 16GB ram). I used MSVC 17.1 C/C++ the curves were virtually the same on Intel ICX 2023 (slightly less fast).

As others have pointed out the biggest speedup you could make here is by only storing odd numbers (and with a small amount of effort you can also avoid storing all multiples of 3 too). The smaller your memory requirements are the further you can go before you exhaust cache memory and then ram.

This is a graph of how the original boolean (which on MSVC is in a char) compares with your original bitset implementation for array length `2^N`. I haven't shown the slightly faster LUT variant for clarity. It would be about 1 tick below the blue line.

This was on an Intel i5-12600 with caches L1D 80kB, L2 2MB, L3 20MB and 16GB ram.

You can clearly see the effect of the L1D cache on the bool array sizes under 80kB. Boolean has maximum advantage for array lengths 4096 to 65536. Bitset has maximum advantage over bool at around 270MB. I am mystified why the bitset timing curve rises a lot more steeply after that value.

What is clear is that for arrays up to 131072 the overheads of the bitset implementation outweigh the simplicity of the classic array of bool array. For array lengths 4 million and beyond bitset is the clear winner with boolean (yellow curve) much slower. Given that my L2 cache is 2MB this explains why it struggles beyond that point. The corresponding L2 cache limit for bitset (8 bits per byte) isn't reached until 16.8 million array elements (blue curve).

This is the sample code with minor improvements added. Default is ORIGINAL & CASE=0 which matches the OP's original code. CASE 1 is char, 2 is short and 3 is int. Comment out

#include <iostream>
#include "intrin.h"

// Crude timing test harness for original sieve code boolean v bitset
// Slightly faster LUT based bitmask added ORIGINAL & CASE (0) are OP's code
// provision added to make custom booleans with size char, short or int
// Code condensed somewhat to shorten it where possible.

#define ORIGINAL
#define CASE  (0)

#ifdef ORIGINAL
#define BITMASK(b) (1 << ((b) % CHAR_BIT))
#define BITSLOT(b) ((b) / CHAR_BIT)
#else
// hard coded micro optimised bits lookup table for CHAR_BIT == 8

const char bitmask[] = { '\1', '\2', '\4', '\10', '\20', '\40', '\100', '\200' };
#define BITMASK(b) bitmask[b & 0x7]
#define BITSLOT(b) ((b) >> 3)
#endif
#define BITSET(a, b) ((a)[BITSLOT(b)] |= BITMASK(b))
#define BITCLEAR(a, b) ((a)[BITSLOT(b)] &= ~BITMASK(b))
#define BITTEST(a, b) ((a)[BITSLOT(b)] & BITMASK(b))
#define BITNSLOTS(num_bits) (((num_bits) + CHAR_BIT - 1) / CHAR_BIT)

#if CASE==0
#define mybool  bool
#define mytrue  true
#define myfalse false
#else
#define mytrue  1
#define myfalse 0
#ifdef CASE==1
#define mybool char
#elif CASE==2
#define mybool short
#elif CASE==3
#define mybool int
#endif
#endif

unsigned long bitset_eratosthenes(mybool* n_prime, unsigned long limit) {
    if (limit < 2)  return 0;  
    if (limit == 2) return 1;
    
//    char* not_prime = (char*)calloc(BITNSLOTS(limit + 1), sizeof(char));
    char* not_prime = (char*)n_prime;
    for (unsigned long i = 4; i <= limit; i += 2) BITSET(not_prime, i);
    
    for (unsigned long i = 9; i <= limit; i += 3) BITSET(not_prime, i);

    unsigned long count = 2;
    for (unsigned long n = 5; n <= limit; n += 6) {
        if (!BITTEST(not_prime, n)) {
            ++count;
            for (unsigned long x = n * n; x <= limit; x += n) BITSET(not_prime, x);
        }
        unsigned long n_plus_2 = n + 2;
        if (n_plus_2 > limit) {
            break;
        }
        if (!BITTEST(not_prime, n_plus_2)) {
            ++count;
            for (unsigned long x = n_plus_2 * n_plus_2; x <= limit; x += n_plus_2) {
                BITSET(not_prime, x);
            }
        }
    }
//    free(not_prime);
    return count;
}
 
unsigned long array_eratosthenes(mybool* not_prime, unsigned long limit) {
    if (limit < 2) return 0;
    if (limit == 2) return 1;

//    mybool* not_prime = (mybool *) calloc(limit + 1, sizeof(mybool));
    for (unsigned long i = 4; i <= limit; i += 2) not_prime[i] = mytrue;
    
    for (unsigned long i = 9; i <= limit; i += 3) not_prime[i] = mytrue;

    unsigned long count = 2;
    for (unsigned long n = 5; n <= limit; n += 6) {
        if (!not_prime[n]) {
            ++count;
            for (unsigned long x = n * n; x <= limit; x += n) not_prime[x] = mytrue;
        }
        unsigned long n_plus_2 = n + 2;
        if (n_plus_2 > limit) break;

        if (!not_prime[n_plus_2]) {
            ++count;
            for (unsigned long x = n_plus_2 * n_plus_2; x <= limit; x += n_plus_2) {
                not_prime[x] = mytrue;
            }
        }
    }
 //   free(not_prime);
    return count;
}

void time_fun(unsigned long (*fun)(mybool *, unsigned long), unsigned long limit)
{
    unsigned long long start, end;
    unsigned long x;
    mybool* buffer = (mybool*)calloc(limit + 1, sizeof(mybool));
    if (limit > 0)
    {
        start = __rdtsc();
        x = (*fun)(buffer, limit);
        end = __rdtsc();
        printf(" %lu %lu %llu\n", limit, x, end - start);
    }
    else
        printf("   N  primes(N) time \n"); // titles
    free(not_prime);
}

int main()
{
    unsigned long i = 1UL;
    time_fun(NULL, 0);
    while (i) { time_fun(array_eratosthenes, i);  i <<= 1;  }
    i = 1UL;
    while (i) { time_fun(bitset_eratosthenes, i);  i <<= 1; }
}

Benchmarks on a single run after allocating a large array are not ideal but this sample test harness is still good enough to show the differences between the bool array and bitset.

So in essence for small arrays under 2 million elements using boolean is simpler and faster or as fast as the bitset method. For much larger arrays the bitset allows you to use available memory much more efficiently retaining very high throughput upto the limits of L3 cache size.