I have two large vectors containing a set of numbers. A typical case would have about 150.000 numbers in each vector ranging from 0 to about 2.000.000. I want to calculate a 3rd vector that contains all the sums of each number in the first vector with each number in the second vector as fast as possible.
My current code:
#include <vector>
namespace
{
// This function is purely for benchmarking, the numbers are retrieved
// from other sources in reality, and have the following properties:
// - The numbers in the vector are sorted
// - The numbers are guaranteed unique
std::vector<uint32_t> populate(size_t pRange, size_t pValues)
{
std::vector<uint32_t> result;
std::srand((unsigned)std::time(0));
result.reserve(pValues);
for (uint32_t i = 0; i < pValues; i++)
{
uint32_t randomValue = std::rand() % pRange;
result.push_back(randomValue);
}
std::sort(result.begin(), result.end());
return result;
}
}
// The function to profile, currently complexity O(n^2)
void calculateVectorSums(std::vector<uint32_t> const & pVector1,
std::vector<uint32_t> const & pVector2,
std::vector<uint8_t> & pResult)
{
pResult.resize(pVector1.back() + pVector2.back() + 1, 0);
for (auto nr1 : pVector1)
{
for (auto nr2 : pVector2)
{
pResult[nr1+nr2] = 1;
}
}
}
int main(void)
{
auto vector1 = populate(2000000, 150000);
auto vector2 = populate(2000000, 150000);
// Loop for averaging the profiling numbers
std::vector<uint8_t> output;
calculateVectorSums(vector1, vector2, output);
}
The bulk of the time is spent in the pResult[nr1+nr2] = 1;
line as expected. And I suspect a huge performance loss happens due to cache misses.
I tried to use a bitset as target buffer, but this was about 2x slower due to the overhead of the bit operations.
I have a burning suspicion that reorganising the order in which the addition is done could lead to a faster overall execution, or perhaps there is a hardware extension (GPU is available) that can accelerate this calculation, but I have very little experience in that field.
I tried to use a bitset as target buffer, but this was about 2x slower due to the overhead of the bit operations.
Here's a different bitwise approach, using a target buffer of uint64_t
and instead of setting bit-by-bit, shift the whole thing over (by a distance of some element from the other vector) and OR by it. This not only gets rid of bit-by-bit writes, but also uses the inherent parallelism of a bitwise operation on an uint64_t
to do a bunch work at the same time (an amount that depends on the density).
void calculateVectorSumsBitwise(std::vector<uint32_t> const& pVector1,
std::vector<uint32_t> const& pVector2,
std::vector<uint8_t>& pResult)
{
pResult.resize(pVector1.back() + pVector2.back() + 1, 0);
std::vector<uint64_t> vec1;
// build initial bitvector
vec1.resize((pVector1.back() + 63) >> 6);
for (auto nr1 : pVector1)
{
vec1[nr1 >> 6] |= uint64_t(1) << (nr1 & 63);
}
std::vector<uint64_t> tmp;
tmp.resize(1 + ((pVector1.back() + pVector2.back() + 63) >> 6));
// copy/OR bitvector to all places indicated by pVector2
for (auto nr2 : pVector2)
{
size_t whole_chunk_offset = nr2 >> 6;
size_t shift_amount = nr2 & 63;
if (shift_amount) {
// if the shift count is not zero, shift spreads across two chunks
for (size_t i = 0; i < vec1.size(); i++) {
tmp[i + whole_chunk_offset] |= vec1[i] << shift_amount;
tmp[i + whole_chunk_offset + 1] |= vec1[i] >> (64 - shift_amount);
}
}
else {
// nice case: shift only by whole word
for (size_t i = 0; i < vec1.size(); i++) {
tmp[i + whole_chunk_offset] |= vec1[i];
}
}
}
// decompress
for (size_t i = 0; i <= pVector1.back() + pVector2.back(); i++)
{
pResult[i] = (tmp[i >> 6] & (uint64_t(1) << (i & 63))) != 0;
}
}
On the same quickbench as the other answer (second bar): https://quick-bench.com/q/jvcQyP9OwpVtNgRIwmw4pv1HGP0
I might try to work out the FFT trick that I mentioned in the comments, but it's complicated.
Fun benchmarking story: on my PC, when I tried this bitwise version it looked ~100x as fast as the original approach, instead of ~2x as fast. I was surprised by the discrepancy between that and the result on quickbench until I remembered something about std::rand
.. and indeed, on my PC, RAND_MAX = 0x7fff
so I was accidentally working with a tiny range of numbers. This bitwise approach benefits massively from a small range, and so the benchmark was ruined. Fun!
This is also why I didn't discover the bugs in the earlier version.