There are many 4 element, 6-element(input) ... 16-input sorting networks but I need a 32 input version to have a 32x32 shear-sort algorithm(which I plan as an Opencl auxiliary function) to have a 1024x1024 shear-sort opencl algorithm. How can I derive my 32-input sorting network?
or just found by trial and error?
Input array: 1M elements ----> 1024x1024 2D matrix with inverted odd-rows (shear)
each row(1024) of matrix --------> 32 x 32 2D matrix (shear)
32 element row ---------> Sorting (network)
Each thread computes one row of 1024 elements. So only 1024 threads for 1M element array.
The non-divergent comparison I plan to use in the network is:
if(a>b) // where a and b are between 0 and 16M
swap(a,b)
becomes
a0=a; b0=b; // saving
c = a-b
d = !(sign bit of c) (0 for negative, 1 for positive)
tmp=b*d; //tmp=a if a>b otherwise 0
a=a*d //a=b if a>b otherwise 0
b=tmp*d; //b=tmp if a>b otherwise 0
// a0 is backup of a, b0 is backup of b
e = (sign bit of c) (1 for negative, 0 for positive)
tmp0=a0*e; //tmp0=a0 if a0<=b0 otherwise 0
a0=b0*e //a0=b0 if a0<=b0 otherwise 0
b0=tmp0*e; //b0=tmp0 if a0<=b0 otherwise 0
aOut=a+a0; // only a or a0 can be different than zero
bOut=b+b0; // only b or b0 can be different than zero
Im sure this is not fastest one but I need to make a quick easy sorting to try on my particle constraint solver which screams sorting for fixed spatial indexing (grid), I have 1M particles and trying a shear of shear of network sorting.
To validate the shear sorting, I implemented 32-input sorting serial bitonic sorter per thread basis to build 32x32 matrix each column and row sorted. So 32x32 = 1024 element sorting took 9 ms and this is too slow for 32 cores @ 700 MHz.
1024 element sorting takes 9 ms and at least 20 iterations will be needed after each 1024 sorting to sort 1M array. Even if it gets 90 ms, this is too slow for just keys. There will be many values bound to keys.(more than 100)
Tried bubblesort in place of bitonic and got 10ms so the problem must be in the shear sort implementation may be?
Currently, the sorting network known to use the least compare-exchange units to sort 32 elements works as follows: sort the first 16 elements with the most efficient sorting network of size 16, do the same for the following 16 elements, then use the merge step from Batcher's odd-even mergesort. Basically, it gives the following pairs of compare-exchange units:
Sort the first half of the array:
[0,1],[2,3],[4,5],[6,7],[8,9],[10,11],[12,13],[14,15],
[0,2],[4,6],[8,10],[12,14],[1,3],[5,7],[9,11],[13,15],
[0,4],[8,12],[1,5],[9,13],[2,6],[10,14],[3,7],[11,15],
[0,8],[1,9],[2,10],[3,11],[4,12],[5,13],[6,14],[7,15],
[5,10],[6,9],[3,12],[13,14],[7,11],[1,2],[4,8],
[1,4],[7,13],[2,8],[11,14],[5,6],[9,10],
[2,4],[11,13],[3,8],[7,12],
[6,8],[10,12],[3,5],[7,9],
[3,4],[5,6],[7,8],[9,10],[11,12],
[6,7],[8,9]
Sort the second half of the array:
[16,17],[18,19],[20,21],[22,23],[24,25],[26,27],[28,29],[30,31],
[16,18],[20,22],[24,26],[28,30],[17,19],[21,23],[25,27],[29,31],
[16,20],[24,28],[17,21],[25,29],[18,22],[26,30],[19,23],[27,31],
[16,24],[17,25],[18,26],[19,27],[20,28],[21,29],[22,30],[23,31],
[21,26],[22,25],[19,28],[29,30],[23,27],[17,18],[20,24],
[17,20],[23,29],[18,24],[27,30],[21,22],[25,26],
[18,20],[27,29],[19,24],[23,28],
[22,24],[26,28],[19,21],[23,25],
[19,20],[21,22],[23,24],[25,26],[27,28],
[22,23],[24,25]
Odd-even merge the two halves of the array:
[0, 16],
[8, 24],
[8, 16],
[4, 20],
[12, 28],
[12, 20],
[4, 8],
[12, 16],
[20, 24],
[2, 18],
[10, 26],
[10, 18],
[6, 22],
[14, 30],
[14, 22],
[6, 10],
[14, 18],
[22, 26],
[2, 4],
[6, 8],
[10, 12],
[14, 16],
[18, 20],
[22, 24],
[26, 28],
[1, 17],
[9, 25],
[9, 17],
[5, 21],
[13, 29],
[13, 21],
[5, 9],
[13, 17],
[21, 25],
[3, 19],
[11, 27],
[11, 19],
[7, 23],
[15, 31],
[15, 23],
[7, 11],
[15, 19],
[23, 27],
[3, 5],
[7, 9],
[11, 13],
[15, 17],
[19, 21],
[23, 25],
[27, 29],
[1, 2],
[3, 4],
[5, 6],
[7, 8],
[9, 10],
[11, 12],
[13, 14],
[15, 16],
[17, 18],
[19, 20],
[21, 22],
[23, 24],
[25, 26],
[27, 28],
[29, 30]
I generated the previous indices pairs with the oddeven_merge
algorithm as given on the Wikipedia page. I can't guarantee that it will be faster than what you already have, but it will at least lower the number of compare-exchange units from 191 (with Batcher's odd-even mergesort) to 185. I have read research papers on the matter and it seems that we currently don't know sorting networks with less than 185 comparators to sort 32 elements.