How to run bitwise OR on big vectors of u64 in the most performant manner?

As a thought experiment, I'm trying to figure our what's the most efficient way to run the bitwise or on a bitmasks of relatively big size (~1M u64 elements) on a single thread.

I'm running it on my laptop with i7-8750H CPU at 2.2GHz, which supports AVX2, but not AVX-512. As I mentioned before, I'm trying to figure out what I can squeeze out of a single thread. Also, I'm somewhat sure that compressing my bitmasks into something like roaring bitmaps will likely yield additional speed-ups, as I would have to ingest less of the bytes into CPU. But I'd like to first understand if there's something obviously wrong with what I'm doing in my current rather simple implementation. I'm targeting x86-64.

I tried to estimate what's the speed ceiling of this operation. Let's start with the obviously very wrong assumption that all the data is in the registers already. _mm256_or_si256 can perform a bitwise or operation on 4x u64s. I will assume that I can only perform one such operation per cycle, which is likely wrong as well and I can do 2-3. We're going to use 1M u64s vectors for our tests. That gives us 1M / 4 per cycle = 250K cycles. Under load my CPU's clock rate stays at ~3GHz. That gives us 250K cycles / 3Ghz ~ 0.00008(3)sec ~ 83µs. That's our naively fast ceiling we're going to use as a reference value.

Let's make it into a code. I used ChatGPT to generate big chunk of it, sorry in advance, I'm not a C++ developer and relatively ignorant to the low-level development.

#include <iostream>
#include <vector>
#include <random>
#include <chrono>
#include <cstdint>
#include <immintrin.h>

std::vector<uint64_t> generate_random_u64s(size_t amount) {
    std::vector<uint64_t> random_u64s;
    random_u64s.reserve(amount);
    std::random_device rd; // Obtain a random number from hardware
    std::mt19937_64 eng(rd()); // Seed the generator
    std::uniform_int_distribution<uint64_t> distr; // Define the range

    for (size_t i = 0; i < amount; ++i) {
        random_u64s.push_back(distr(eng));
    }
    return random_u64s;
}

void avx_bitwise_or(
    const std::vector<uint64_t>& vector1, 
    const std::vector<uint64_t>& vector2, 
    std::vector<uint64_t>& result
) {
    size_t size = vector1.size();
    
    // Ensure result has enough space
    if (result.size() != size) {
        result.resize(size);
    }

    size_t i = 0;
    for (; i + 4 <= size; i += 4) {
        // Prefetch data into CPU cache
        _mm_prefetch(reinterpret_cast<const char*>(&vector1[i + 256]), _MM_HINT_T0);
        _mm_prefetch(reinterpret_cast<const char*>(&vector2[i + 256]), _MM_HINT_T0);

        // Load vectors using AVX
        __m256i vec1 = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vector1[i]));
        __m256i vec2 = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vector2[i]));

        // Perform bitwise OR operation
        __m256i vec_result = _mm256_or_si256(vec1, vec2);

        // Use non-temporal store to write result back to memory bypassing CPU cache
        _mm256_stream_si256(reinterpret_cast<__m256i*>(&result[i]), vec_result);
    }

    // Handle remaining elements that don't fit in AVX registers
    for (; i < size; ++i) {
        result[i] = vector1[i] | vector2[i];
    }
}

int main() {
    std::cout << "Starting" << std::endl;
    const size_t size = 1'000'000;
    auto vector1 = generate_random_u64s(size);
    auto vector2 = generate_random_u64s(size);

    auto result = std::vector<uint64_t>(size);

    auto start = std::chrono::high_resolution_clock::now();
    const int repetitions = 10000;

    for (int i = 0; i < repetitions; ++i) {
        avx_bitwise_or(vector1, vector2, result);
    }

    auto duration = std::chrono::high_resolution_clock::now() - start;

    uint32_t popcnt = 0;
    for (const auto& x : result) {
        popcnt += __builtin_popcountll(x); // Count the number of set bits (1s)
    }

    std::cout << "Popcnt is: " << popcnt << std::endl;
    std::cout << "Time elapsed is: " << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << " ms" << std::endl;

    return 0;
}

I build it with g++ -O3 -mavx2 -o experiment main.cpp. When I'm running it, it takes around 9 seconds to run 10 000 iterations, which is 900µs per iteration. This is ~10+ times slower than our naive back of the envelope calculations. I tried unrolling the loops, but it didn't gave my any tangible performance results. Maybe I just did it wrong somehow.

Let's contrast it with the code that avoids reading from memory:

uint64_t single_register_test (
    const std::vector<uint64_t>& vector1, 
    const std::vector<uint64_t>& vector2
) {
    size_t size = vector1.size();

    // Load vectors using AVX, use only first 4 elements and ignore everything else
    __m256i vec1 = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vector1[0]));
    __m256i vec_result = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vector2[0]));

    // Perform bitwise OR operation on the same data over and over again
    for (size_t i = 0; i + 4 <= size; i += 4) {
        vec_result = _mm256_or_si256(vec1, vec_result);
    }

    // Get first u64 from the result
    auto result = _mm256_extract_epi64(vec_result, 0);

    return result;
}

// ...........

    auto start = std::chrono::high_resolution_clock::now();
    const int repetitions = 100'000;

    uint32_t popcnt_single = 0;
    for (int i = 0; i < repetitions; ++i) {
        auto x = single_register_test(vector1, vector2);
        popcnt_single += __builtin_popcountll(x);
    }
    std::cout << "Popcnt single is: " << popcnt_single << std::endl;
    auto duration = std::chrono::high_resolution_clock::now() - start;

It took ~9 seconds for 100K iteration for the same 1M x u64 vector, which is 90µs per iteration, very close to our 83µs naive reference value.

So, I wonder what I can do to push the performance of my code that reads from the memory further? Is there anything I can do at all?

Solution

Your arrays don't fit even in L3 cache, so your bottleneck is DRAM bandwidth, not computation.
If your data fit in L1d cache, your analysis would be about right.

At stock DRAM speeds, that CPU's theoretical max memory bandwidth is 41.8 GB/s (https://ark.intel.com/content/www/us/en/ark/products/134906/intel-core-i7-8750h-processor-9m-cache-up-to-4-10-ghz.html), so about 14 bytes per clock cycle at 3 GHz (total read+written - DRAM is half duplex).

And it's desktop/laptop Skylake-family CPU so you can realistically expect to get close to that bandwidth with just a single core, in a single-threaded program. (Unlike a many-core Xeon.) Like maybe 80 to 90% of theoretical max DRAM bandwidth.

That 14 B/c theoretical peak DRAM bandwidth is far slower than the 2x 256-bit loads + 1x 256-bit store per clock cycle you're estimating to feed 1x vpor per cycle; that would be peak L1d bandwidth (96 B / cycle).
Computation is not remotely close to being a bottleneck with this array size, only if you used arrays of about 8K so three of them could fit in L1d cache.

For smaller arrays, obviously don't use NT aka stream stores; they bypass cache and force eviction if the destination cache line was previously hot.

For larger arrays, _mm256_stream_si256 should be faster than plain stores for this. With plain stores that miss in cache, the CPU has to read the old data into cache so it can update the part of the cache line you're storing with a MESI Read For Ownership which also gets exclusive ownership so it's allowed to flip the line to Modified. NT stores just invalidate copies in other cores without reading.

To go faster in your real use-case, you'd need to cache-block your problem (do more with chunks of your data while they're hot in cache). Or increase the computational intensity by combining more work into one pass over the data, e.g. popcount the vectors as you generate them, instead of storing to a result array and reading that. https://github.com/WojciechMula/sse-popcount/ has well-optimized code for SSE4, AVX, and AVX-512 with/without VPOPCOUNT. You can adapt it to take 2 inputs and or instead of just loading from an array.

Related Q&As:

Why does not AVX further improve the performance compared with SSE2?
Why is this AVX code slower?
How to Improve XORing of large uint64 arrays? - has an OpenCL answer which mentions using that to parallelize across CPU cores. Could be helpful on some CPU models, maybe not yours.

for this particular case it [popcount] was mostly used to prevent the optimizer from throwing out useless computations.

Speaking of that, I'm surprised GCC fails to optimize away the loop in the no load/store version. v |= x; is idempotent after the first iteration. Clang knows this and correctly removes the loop. (And only ORs the scalar element that you extract; LLVM's shuffle optimizer can see through shuffle intrinsics.) I tried with some variations on the loop condition to make sure GCC wasn't getting tripped up by a possibly-infinite loop (<= size with an unsigned size_t), but that's not it. https://godbolt.org/z/zxz8f86M9 . GCC's asm output has a dependency chain through vpor like your source does, so this is measuring vpor latency but not throughput for the reg-reg version. (That's fine: with 2 loads per vpor you're not going to go faster except on Alder Lake and Zen 3 and later, and then only with data hot in L1d.)

Your benchmark avoids many of the pitfalls mentioned in Idiomatic way of performance evaluation?, e.g. using std::vector for the output makes the compiler write 0s to it before the timed region, so you aren't timing page faults even on the first iteration of the repeat loop. And the repeat loop gives plenty of time for warm-up of turbo and amortizing any startup effects.