[SOLVED] Finding closest pair of points in the plane with non-distinct x-coordinates in O(nlogn)

Finding closest pair of points in the plane with non-distinct x-coordinates in O(nlogn)

Most of the implementations of the algorithm to find the closest pair of points in the plane that I've seen online have one of two deficiencies: either they fail to meet an O(nlogn) runtime, or they fail to accommodate the case where some points share an x-coordinate. Is a hash map (or equivalent) required to solve this problem optimally?

Roughly, the algorithm in question is (per CLRS Ch. 33.4):

For an array of points P, create additional arrays X and Y such that X contains all points in P, sorted by x-coordinate and Y contains all points in P, sorted by y-coordinate.
Divide the points in half - drop a vertical line so that you split X into two arrays, X_L and X_R, and divide Y similarly, so that Y_L contains all points left of the line and Y_R contains all points right of the line, both sorted by y-coordinate.
Make recursive calls for each half, passing X_L and Y_L to one and X_R and Y_R to the other, and finding the minimum distance, d in each of those halves.
Lastly, determine if there's a pair with one point on the left and one point on the right of the dividing line with distance smaller than d; through a geometric argument, we find that we can adopt the strategy of just searching through the next 7 points for every point within distance d of the dividing line, meaning the recombination of the divided subproblems is only an O(n) step (even if it looks n² at first glance).

This has some tricky edge cases. One way people deal with this is sorting the strip of points of distance d from the dividing line at every recombination step (e.g. here), but this is known to result in an O(nlog²n) solution.

Another way people deal with edge cases is by assuming each point has a distinct x-coordinate (e.g. here): note the snippet in closestUtil which adds to Pyl (or Y_L as we call it) if the x-coordinate of a point in Y is <= the line, or to Pyr (Y_R) otherwise. Note that if all points lie on the same vertical line, this would result us writing past the end of the array in C++, as we write all n points to Y_L.

So the tricky bit when points can have the same x-coordinate is dividing the points in Y into Y_L and Y_R depending on whether a point p in Y is in X_L or X_R. The pseudocode in CLRS for this is (edited slightly for brevity):

for i = 1 to Y.length
    if Y[i] in X_L
        Y_L.length = Y_L.length + 1;
        Y_L[Y_L.length] = Y[i]
    else Y_R.length = Y_R.length + 1;
        Y_R[Y_R.length] = Y[i]

However, absent of pseudocode, if we're working with plain arrays, we don't have a magic function that can determine whether Y[i] is in X_L in O(1) time. If we're assured that all x-coordinates are distinct, sure - we know that anything with an x-coordinate less than the dividing line is in X_L, so with one comparison we know what array to partition any point p in Y into. But in the case where x-coordinates are not necessarily distinct (e.g. in the case where they all lie on the same vertical line), do we require a hash map to determine whether a point in Y is in X_L or X_R and successfully break down Y into Y_L and Y_R in O(n) time? Or is there another strategy?

Solution

Yes, there are at least two approaches that work here.

The first, as Bing Wang suggests, is to apply a rotation. If the angle is sufficiently small, this amounts to breaking ties by y coordinate after comparing by x, no other math needed.

The second is to adjust the algorithm on G4G to use a linear-time partitioning algorithm to divide the instance, and a linear-time sorted merge to conquer it. Presumably this was not done because the author valued the simplicity of sorting relative to the previously mentioned algorithms in most programming languages.