I implemented an algorithm which used sorting. I tried Thrust::sort_by_key that took around 0.4s to sort an array with 10^7 elements.
I thought bitonic sorting network should be faster than Thrust::sort_by_key. However, bitonic sorting took about 2.5s to sort the same array mentioned above. I used the bitonic sorting network provided by SDK. I just modified the original bitonic sort a little bit.
Could you tell me why? or give me some advice?
Thanks,
Yik
Aug, 15, 2011
The short answer is that the bitonic sorting example provided by the CUDA SDK is primarily meant to be pedagogical. It simply isn't as optimized as Thrust's sorting implementation, which is based on the very efficient radix sort provided by Merrill.
The asymptotic performance of the two algorithms is also different. A bitonic sorting network has O(n lg^2 n)
complexity, while the radix sort employed by Thrust has something more like O(#bits n)
complexity, where #bits
denotes the number of bits required to represent the input data. In other words, the radix sort requires asymptotically fewer global memory reads and writes than does the bitonic sort.
You might find that for smaller workloads, bitonic sort performs better.