c++sortinghashbigdatacorrectness

Sorting algorithm correctness verification


I am trying to verify the correctness of a sorting algorithm S that is sorting a large array A of at least 4 GB. Assuming S is sorting in non-decreasing order, checking only A[i - 1] <= A[i] for 1 <= i < n is not sufficient. This is because the keys produced by S, even though sorted, may contain one or more keys that do not belong to the original A.

I can think of at least two trivial ways to test the correctness:

  1. Make a copy of A to A_copy before A is sorted, use std::sort on A_copy, and check A[i] == A_copy[i] for 0 <= i < n after the sort.
  2. Maintain a std::unordered_map to store the frequency of the keys in A before sort, and verify with the frequency after the sort in addition to the non-decreasing order check.

There are obvious issues with the above approaches. std::sort is extremely slow for large data and requires O(n) additional memory. Using a map should be faster but also requires extra O(n) memory if the keys are unique.

My question: is there any better way to perform this sort correctness check that is both fast and uses O(1) extra memory?

Thanks.


Solution

  • You can treat your algorithm as a message being transferred over an unreliable channel, and utilize error detection/correction methods. Main different is your data is getting out of the original order while most error correction are sensitive to the position, though not all of them.

    One simple solution is store the XOR value of hash(a) for all a in A, though it can only reliably detect if one element is added (if for example an element was added twice, it will fail to identify it).

    int verification = 0;
    for (const auto& a : A) {
      verification ^= hash(a)
    }
    mySort(A);
    for (const auto& a : A) {
      verification ^= hash(a)
    }
    
    if (verification != 0) {
      // invalid
    } else {
      // valid
    }
    

    The literature contains much more options for identifying or even correcting errors on wires which you can utilize. These will allow you a nice trade-off between the amount of extra memory you use and the number of mistakes you are able to find.