[SOLVED] Floyd–Rivest vs. Introselect algorithm performance

Floyd–Rivest vs. Introselect algorithm performance

Google couldn't help me so here it goes: which of the two selection algorithms, FloydRivest algorithm and Introselect, has better performance.

I'm assuming It's FloydRivest algorithm, but want to be 100% sure.

Also, if there exist even better algorithm for this purpose, I'd be glad to hear about them.

Solution

TLDR; I think Floyd-Rivest is better

I recently did some research on selection algorithms for a project I am working on. Here is a basic description of each algorithm:

Introselect: Performs a bipartitioning of the data, with a single pivot. Initially, a simple pivot (e.g. middle, median-of-3, etc) is chosen. The simple pivot is usually O(n^2) in the worst case, but O(n) on average. If the recursion level goes above a certain threshold, we fallback to a median-of-medians pivot. This is more costly to compute, but guarantees O(n) worst case.
Floyd-Rivest: Performs a quintary partition of the data, with two pivots. (Technical note: The partitioning starts out as quintary, and finally resolves to three partitions once the algorithm completes). The two pivots are chosen so that the kth element lies between them, with high probability (this involves randomly sampling the data, and selecting two elements, through recursion, above and below what would be the nth-element). When the size of the partition becomes small enough, we select the pivots using a simpler method (e.g. median-of-3, etc)

As you can see, both are quite similar. Introselect starts with simple pivots, falling back to a complicated one; the Floyd-Rivest algorithm does just the opposite. The main difference is that introselect uses median-of-medians, whereas Floyd-Rivest uses a recursive sampling technique. So, I think a better comparison is median-of-medians vs Floyd-Rivest.

Which is better? From my research, it appears the hidden constants for Floyd-Rivest are smaller than median-of-medians. If I remember correctly, median-of-medians requires something like 5n comparisons (worst case), whereas Floyd-Rivest only needs 3.5n. Floyd-Rivest also uses a quintary scheme, which is better when the data can have lots of duplicates. Both introselect and Floyd-Rivest reduce to the same algorithm for small inputs, so you should get similar performance there (so long as you implement them the same). In my tests, Floyd-Rivest was 20%+ faster than all the other selection algorithms I tried. Though, I must admit, I did not test against a proper implementation of introselect that falls back to median-of-medians (I just tested the pseudo-introselect of libstdc++). In the original Floyd-Rivest paper, they themselves (who were co-authors of the median-of-medians approach) said median-of-medians "is hardly practical", and that the Floyd-Rivest algorithm was "probably the best practical choice".

So, it seems to me, Floyd-Rivest's pivoting technique is better than median-of-medians. You could implement introselect using Floyd-Rivest's pivoting, but then you might as well just do a pure Floyd-Rivest algorithm. I would recommend Floyd-Rivest as the best selection method.

Be warned! The original Floyd-Rivest paper gives an example implementation of their algorithm (this is the implementation listed on Wikipedia, at the time of writing this). However, this is a simplified version. From my tests, the simplified version is actually pretty slow! If you want a fast implementation, I think you'll need to implement the full algorithm. I recommend reading the paper "On Floyd and Rivest's SELECT algorithm" by Krzysztof C. Kiwiel. He gives a pretty good description of how to implement a fast Floyd-Rivest selection.