I am trying to implement a multithreaded version of bitonic sort on CPU, using C++. At the moment the best I can get with this implementation is a speedup of around 4.3, (ordering an array of 128 MB) even when I am using 16 threads. The code I used is the following:
void compareAndSwap(std::vector<uint32_t>& paddedValues, unsigned int threadId,
unsigned int chunkSize, unsigned int mergeStep, unsigned int bitonicSequenceSize)
{
unsigned int startIndex = threadId * chunkSize;
unsigned int endIndex = (threadId + 1) * chunkSize;
// Process the chunk assigned to this thread
for (unsigned int currentIndex = startIndex; currentIndex < endIndex; currentIndex++)
{
// Find the element to compare with
unsigned int compareIndex = currentIndex ^ mergeStep;
// Only compare if the compareIndex is greater (to avoid duplicate swaps)
if (compareIndex > currentIndex)
{
bool shouldSwap = false;
// Determine if we should swap based on the current subarray's sorting direction
if ((currentIndex & bitonicSequenceSize) == 0) // First half of subarray (ascending)
{
shouldSwap = (paddedValues[currentIndex] > paddedValues[compareIndex]);
}
else // Second half of subarray (descending)
{
shouldSwap = (paddedValues[currentIndex] < paddedValues[compareIndex]);
}
// Perform the swap if necessary
if (shouldSwap)
{
std::swap(paddedValues[currentIndex], paddedValues[compareIndex]);
}
}
}
}
void bitonicSort(uint32_t values[], unsigned int arrayLength, unsigned int numThreads, int sortOrder)
{
// Step 1: Pad the array to the next power of 2
unsigned int paddedLength = 1 << static_cast<int>(std::ceil(std::log2(arrayLength)));
std::vector paddedValues(paddedLength, std::numeric_limits<uint32_t>::max());
std::copy(values, values + arrayLength, paddedValues.begin());
// Step 2: Determine chunk size for each thread
unsigned int chunkSize = paddedLength / numThreads;
// Step 3: Iteratively build and merge bitonic sequences
// Outer loop: controls the size of bitonic sequences
for (unsigned int bitonicSequenceSize = 2; bitonicSequenceSize <= paddedLength; bitonicSequenceSize *= 2)
{
// Middle loop: controls the size of sub-sequences being merged
for (unsigned int mergeStep = bitonicSequenceSize / 2; mergeStep > 0; mergeStep /= 2)
{
// Step 4: Use multiple threads to compare and swap elements in parallel
std::vector<std::thread> threads;
threads.reserve(numThreads);
// Thread creation loop
for (unsigned int threadId = 0; threadId < numThreads; threadId++)
{
threads.emplace_back(compareAndSwap,
std::ref(paddedValues),
threadId,
chunkSize,
mergeStep,
bitonicSequenceSize);
}
// Wait for all threads to complete this stage
for (auto& thread : threads)
{
thread.join();
}
}
}
// Step 5: Copy back the sorted values
std::copy(paddedValues.begin(), paddedValues.begin() + arrayLength, values);
// Step 6: If descending order is required, reverse the array
if (sortOrder == 0)
{
std::reverse(values, values + arrayLength);
}
}
The problem seems to be the high number of misses that I get (I verified it by using amd uProf, I am using a Ryzen 9 7940hs to perform the tests). In particular there is a higher number of misses when I use a higher thread count, and that seems to nullify the potential benefits of the increase in the number of threads available. How could I solve this problem?
Sorting relatively-large arrays in parallel is generally memory-bound. This is especially true on a bitonic sort on large arrays because it tends to iterate a lot of time on the same memory chunks that cannot be in cache when merged arrays are too big.
This problem is not only for bitonic sort but most algorithm doing comparisons. For exemple, Quick sort (which is cache-friendly) is better for that, but still far from being great. This problem causes most sorting algorithm not to scale (at least when the comparison operator is a cheap one). You can test this hypothesis by experimentally measuring the DRAM throughput of your machine and compare it to its (practical) bandwidth (the famous Stream Triad benchmark can measure that like a lot of other software nowadays). IDK if there are hardware counters for that on your Zen4 CPU though (AMD Zen2 AFAIK cannot directly measure that while Intel Skylake can).
For large arrays, the best solution is simply to put items in buckets in 1 or 2 steps so to avoid most memory-bound steps. Then, you can sort buckets independently in parallel. This combination of sorting algorithms results in one of the most efficient parallel sorting algorithm (for CPU) I designed so far. This assumes items can be stored to buckets so there is no need to compare items (this is ok for native types like integers and floating-point numbers with trivial comparison operators). You can find an efficient parallel C++ implementation here (meant to be used by a Python script). The sort of blocks is done using an external (x86-specific) library and AFAIK it internally uses an aggressively-optimized SIMD-aware bitonic sort.
Note the bitonic sort is good for sorting small-arrays especially because is it SIMD-friendly. This is great (for native types) on CPUs once optimized to benefit from SIMD instruction and even better for GPU (as long as the arrays to be sorted are not big ones). For large arrays, the complexity of the bitonic sort is pretty bad compared to an intro-sort (or a radix sort). This is why people tends not to use it in this case (even on GPUs as long as comparisons are not required).
By the way, you tagged the question openmp
but I see no OpenMP directive in the code. It can make your code simpler. For example, you can create threads with just #pragma omp parallel for
. This also avoid recreating threads over and over, which is not efficient. You can do a global thread synchronization easily with #pragma omp barrier
if needed. You can also write a task implementation (though probably not needed here) with #pragma omp task
and #pragma omp taskwait
.
AFAIK high-end AMD Zen CPUs can have multiple NUMA nodes (IDK for your specific CPU). This means that the parallel sorting algorithm should consider NUMA effects. Here, this can be done by:
schedule(static)
in OpenMP)schedule(static)
in OpenMP, with the same number of thread and thread binding).Sequential initialisation tends to store data on only 1 NUMA node preventing parallel algorithms to scale (even more). Indeed, it tends to saturate one specific (integrated) memory controller and their associated DRAM while other ones are left (mostly) unused.
You can check how many the NUMA node you have with numactl --hardware
on Linux.