Efficient calculation of the sum of each entry in two huge vectors

I have two large vectors containing a set of numbers. A typical case would have about 150.000 numbers in each vector ranging from 0 to about 2.000.000. I want to calculate a 3rd vector that contains all the sums of each number in the first vector with each number in the second vector as fast as possible.

My current code:

#include <vector>

namespace
{
    // This function is purely for benchmarking, the numbers are retrieved
    // from other sources in reality, and have the following properties:
    // - The numbers in the vector are sorted
    // - The numbers are guaranteed unique
    std::vector<uint32_t> populate(size_t pRange, size_t pValues)
    {
        std::vector<uint32_t> result;
        std::srand((unsigned)std::time(0));
        result.reserve(pValues);
        for (uint32_t i = 0; i < pValues; i++)
        {
            uint32_t randomValue = std::rand() % pRange;
            result.push_back(randomValue);
        }
        std::sort(result.begin(), result.end());
        return result;
    }
}

// The function to profile, currently complexity O(n^2)
void calculateVectorSums(std::vector<uint32_t> const & pVector1,
                         std::vector<uint32_t> const & pVector2,
                         std::vector<uint8_t> & pResult)
{
    pResult.resize(pVector1.back() + pVector2.back() + 1, 0);
    for (auto nr1 : pVector1)
    {
        for (auto nr2 : pVector2)
        {
            pResult[nr1+nr2] = 1;
        }
    }
}

int main(void)
{
    auto vector1 = populate(2000000, 150000);
    auto vector2 = populate(2000000, 150000);

    // Loop for averaging the profiling numbers
    std::vector<uint8_t> output;
    calculateVectorSums(vector1, vector2, output);
}

The bulk of the time is spent in the pResult[nr1+nr2] = 1; line as expected. And I suspect a huge performance loss happens due to cache misses.

I tried to use a bitset as target buffer, but this was about 2x slower due to the overhead of the bit operations.

I have a burning suspicion that reorganising the order in which the addition is done could lead to a faster overall execution, or perhaps there is a hardware extension (GPU is available) that can accelerate this calculation, but I have very little experience in that field.

Solution

I tried to use a bitset as target buffer, but this was about 2x slower due to the overhead of the bit operations.

Here's a different bitwise approach, using a target buffer of uint64_t and instead of setting bit-by-bit, shift the whole thing over (by a distance of some element from the other vector) and OR by it. This not only gets rid of bit-by-bit writes, but also uses the inherent parallelism of a bitwise operation on an uint64_t to do a bunch work at the same time (an amount that depends on the density).

void calculateVectorSumsBitwise(std::vector<uint32_t> const& pVector1,
    std::vector<uint32_t> const& pVector2,
    std::vector<uint8_t>& pResult)
{
    pResult.resize(pVector1.back() + pVector2.back() + 1, 0);
    std::vector<uint64_t> vec1;
    // build initial bitvector
    vec1.resize((pVector1.back() + 63) >> 6);
    for (auto nr1 : pVector1)
    {
        vec1[nr1 >> 6] |= uint64_t(1) << (nr1 & 63);
    }

    std::vector<uint64_t> tmp;
    tmp.resize(1 + ((pVector1.back() + pVector2.back() + 63) >> 6));

    // copy/OR bitvector to all places indicated by pVector2
    for (auto nr2 : pVector2)
    {
        size_t whole_chunk_offset = nr2 >> 6;
        size_t shift_amount = nr2 & 63;
        if (shift_amount) {
            // if the shift count is not zero, shift spreads across two chunks
            for (size_t i = 0; i < vec1.size(); i++) {
                tmp[i + whole_chunk_offset] |= vec1[i] << shift_amount;
                tmp[i + whole_chunk_offset + 1] |= vec1[i] >> (64 - shift_amount);
            }
        }
        else {
            // nice case: shift only by whole word
            for (size_t i = 0; i < vec1.size(); i++) {
                tmp[i + whole_chunk_offset] |= vec1[i];
            }
        }
    }

    // decompress
    for (size_t i = 0; i <= pVector1.back() + pVector2.back(); i++)
    {
        pResult[i] = (tmp[i >> 6] & (uint64_t(1) << (i & 63))) != 0;
    }
}

On the same quickbench as the other answer (second bar): https://quick-bench.com/q/jvcQyP9OwpVtNgRIwmw4pv1HGP0

I might try to work out the FFT trick that I mentioned in the comments, but it's complicated.

Fun benchmarking story: on my PC, when I tried this bitwise version it looked ~100x as fast as the original approach, instead of ~2x as fast. I was surprised by the discrepancy between that and the result on quickbench until I remembered something about std::rand .. and indeed, on my PC, RAND_MAX = 0x7fff so I was accidentally working with a tiny range of numbers. This bitwise approach benefits massively from a small range, and so the benchmark was ruined. Fun!

This is also why I didn't discover the bugs in the earlier version.