I implemented Median of medians selection algorithm based on algs4 quickselect using the Wikipedia article, but my code doesn't work well:
1) it is said that median of medians finds kth largest element. However, my code finds kth smallest element.
2) my implementation runs 1-20 times slower than quickselect, but the median of medians algorithm should be asymptotically faster.
I've checked everything several times, but I cannot find the issue.
public class MedianOfMedians {
public static Comparable medianOfMedians(Comparable[] nums, int k) {
return nums[select(nums, 0, nums.length - 1, k)];
}
private static int select(Comparable[] nums, int lo, int hi, int k) {
while (lo < hi) {
int pivotIndex = pivot(nums, lo, hi);
int j = partition(nums, lo, hi, pivotIndex);
if (j < k) {
lo = j + 1;
} else if (j > k) {
hi = j - 1;
} else {
return j;
}
}
return lo;
}
private static int pivot(Comparable[] list, int left, int right) {
// for 5 or less elements just get median
if (right - left < 5) {
return partition5(list, left, right);
}
// otherwise move the medians of five-element subgroups to the first n/5 positions
for (int i = left; i <= right; i += 5) {
// get the median of the i'th five-element subgroup
int subRight = i + 4;
if (subRight > right) {
subRight = right;
}
int median5 = partition5(list, i, subRight);
exch(list, median5, (int) (left + Math.floor((i - left) / 5d)));
}
// compute the median of the n/5 medians-of-five
return select(list,
left,
(int) (left + Math.ceil((right - left) / 5d) - 1),
(int) (left + (right - left) / 10d));
}
private static int partition5(Comparable[] list, int lo, int hi) {
for (int i = lo; i <= hi; i++) {
for (int j = i; j > lo; j--) {
if (less(list[j - 1], list[j])) {
exch(list, j, j - 1);
}
}
}
return (hi + lo) / 2;
}
private static int partition(Comparable[] a, int lo, int hi, int pivotIndex) {
exch(a, lo, pivotIndex);
int i = lo;
int j = hi + 1;
Comparable v = a[lo];
while (true) {
while (less(a[++i], v) && i != hi) { }
while (less(v, a[--j]) && j != lo) { }
if (j <= i) break;
exch(a, i, j);
}
exch(a, j, lo);
return j;
}
private static void exch(Comparable[] nums, int i, int j) { }
private static boolean less(Comparable v, Comparable w) { }
}
JUnit test:
public class MedianOfMediansTest {
private final static int TESTS_COUNT = 100;
@org.junit.Test
public void test() {
// generate TESTS_COUNT arrays of 10000 entries from 0..Integer.MAX_VALUE
Integer[][] tests = generateTestComparables(TESTS_COUNT, 10000, 10000, 0, Integer.MAX_VALUE);
for (int i = 0; i < tests.length; i++) {
Integer[] array1 = Arrays.copyOf(tests[i], tests[i].length);
Integer[] array2 = Arrays.copyOf(tests[i], tests[i].length);
Integer[] array3 = Arrays.copyOf(tests[i], tests[i].length);
long time = System.nanoTime();
final int a = (Integer) MedianOfMedians.medianOfMedians(array1, 0);
long nanos_a = System.nanoTime() - time;
time = System.nanoTime();
final int b = (Integer) Quick.select(array2, 0);
long nanos_b = System.nanoTime() - time;
time = System.nanoTime();
Arrays.sort(array3);
final int c = array3[0];
long nanos_c = System.nanoTime() - time;
System.out.println("MedianOfMedians: " + a + " (" + nanos_a + ") " +
"QuickSelect: " + b + " (" + nanos_b + ") " +
"Arrays.sort: " + c + " (" + nanos_c + ")");
System.out.println(((double) nanos_a) / ((double) nanos_b));
Assert.assertEquals(c, a);
Assert.assertEquals(b, a);
}
}
public static Integer[][] generateTestComparables(int numberOfTests,
int arraySizeMin, int arraySizeMax,
int valueMin, int valueMax) {
Random rand = new Random(System.currentTimeMillis());
Integer[][] ans = new Integer[numberOfTests][];
for (int i = 0; i < ans.length; i++) {
ans[i] = new Integer[randInt(rand, arraySizeMin, arraySizeMax)];
for (int j = 0; j < ans[i].length; j++) {
ans[i][j] = randInt(rand, valueMin, valueMax);
}
}
return ans;
}
public static int randInt(Random rand, int min, int max) {
return (int) (min + (rand.nextDouble() * ((long) max - (long) min)));
}
}
1) it is said that median of medians finds kth largest element. However, my code finds kth smallest element.
This is not strictly true. Any selection algorithm can find either smallest or largest element because that's essentially the same task. It depends on how you compare elements and how you partition them (and you can always do something like length - 1 - result
later). Your code indeed seems to find the kth smallest element, which is by the way the most typical and intuitive way of implementing a selection algorithm.
2) my implementation runs 1-20 times slower than quickselect, but the median of medians algorithm should be asymptotically faster.
Not just asymptotically faster. Asymptotically faster in the worst case. In the average case, both are linear, but MoM has higher constant factors. Since you generate your tests randomly, you are very unlikely to hit the worst case. If you used randomized quickselect, then for any input it's unlikely to hit the worst case, otherwise the probability will depend on the pivot selection algorithm used.
With that in mind, and the fact that median of medians has high constant factors, you should not expect it to perform better than quickselect! It might outperform sorting, though, but even then—those logarithmic factors in sorting aren't that large for small inputs (lg 10000 is about 13-14).
Take my MoM solution for a LeetCode problem, for example. Arrays.sort
sometimes outperforms MoM for arrays with 500 million elements. In the best case it runs about twice faster, though.
Therefore, MoM is mostly of theoretical interest. I could imagine a practical use case when you need 100% guarantee of not exceeding some time limit. Say, some real-time system on an aircraft, or spacecraft, or nuclear reactor. The time limit is not very tight, but exceeding it even by one nanosecond is catastrophic. But it's an extremely contrived example, and I doubt that it's actually the way it works.
Even if you can find a practical use case for MoM, you can probably use something like Introselect instead. It essentially starts with quickselect, and then switches to MoM if things don't look good. But testing it would be a nightmare—how would you come up with a test that actually forces the algorithm to switch (and therefore test the MoM part), especially if it's randomized?
Your code looks fine overall, but I'd make some helper methods package-private or even moved them to another class to test separately because such things are very hard to get right. And you may not notice the effect if the result is right. I'm not sure that your groups-of-five code is 100% correct, for example. Sometimes you use right - left
where I'd expect to see element count, which should be right - left + 1
.
Also, I would replace those ceil/floor calls with pure integer arithmetic equivalents. That is, Math.floor((i - left) / 5d))
=> (i - left) / 5
, Math.ceil((right - left) / 5d)
=> (right - left + 4) / 5
(this is the part where I don't like the right - left
thing, by the way, but I'm not sure if it's wrong).