I was working on network sort (for arrays smaller than 8) and noticed that all the algorithms focus on its ability to allow parallel operations. Here is one such set for an array of size 5.
#define SWAP(x,y) if (data[y] < data[x]) { int tmp = data[x]; data[x] = data[y]; data[y] = tmp; }
//Parallelizable
SWAP(1, 2);
SWAP(4, 5);
//Parallelizable
SWAP(0, 2);
SWAP(3, 5);
//Parallelizable
SWAP(0, 1);
SWAP(3, 4);
SWAP(2, 5);
//Parallelizable
SWAP(0, 3);
SWAP(1, 4);
//Parallelizable
SWAP(2, 4);
SWAP(1, 3);
//Parallelizable
SWAP(2, 3);
I was working with long int
arrays (So each element is 8 bytes in size). So is there any easy way to parallelize these operations in C ? Is there any hardware specific commands I can use to achieve this (SIMD, ASM(x86) etc.)
As explained by this answer to a question about sorting small collections, you can actually make your swap code more performant by changing its definition to the following one:
#define SWAP(x, y) { \
int dx = data[x]; \
data[x] = dx < data[y] ? dx : data[y]; \
data[y] ^= dx ^ data[x]; \
}
According to the research paper Applying Sorting Networks to Synthesize Optimized Sorting Libraries, this version of SWAP
is branchless and compiles down to a mere 5 instructions on GCC or Clang with a decent optimization level. The article also hints at the fact that the low number of instructions may actually make the code benefit from instruction-level parallelism.
If xor
does not work for the types to be sort, you can use an alternative version of SWAP
that uses two conditionals instead of one, which should be almost as fast as the xor
version. Actually, I use this trick in a sorting library of mine and sorting a small fixed-size collection of integers with sorting networks went from « not really better than insertion sort » to « several times faster than insertion sort » when I introduced the trick. Sorting a collection of 8 integers is ~5 times faster with sorting networks than with an insertion sort on my computer.