I've familiarized myself with quickselect and median-of-medians for fast selection of the k-th element in an unsorted array. If you try hard enough, you can guarantee the worst case time complexity to lie in O(n).
My problem is a little different. I'd like to select the k-th number from an unsorted array which contains a very large amount of unpredictable duplicates. What I wonder, is whether there's an approach which is both memory and time efficient with respect to the amount of unique values u
as opposed to the total size of the input n
. The catch is that sometimes u << n
and sometimes u ~ n
. (In practice, u
is almost constant, while n
fluctuates heavily.)
Bad approach 1 (excuse my python pseudocode, the problem is not related to python specifically):
input = ...
k = ...
m = hashmap()
for value in input:
if value exists in m:
m[value] = m[value] + 1
else:
m[value] = 1
cumulative_sum = 0
for unique_value in ordered(m):
cumulative_sum += m[unique_value]
if cumulative_sum > k:
return unique_value
This is currently my baseline. What I don't like about it is that ordering or keeping m
ordered using comparison takes O(u*logu)
time.
Bad approach 2:
input = ...
k = ...
M = some_value
assert type(input) == integral
assert min(input) == 0
assert max(input) == M
a = array(size=M+1, default_value=0)
for value in input:
m[value] = m[value] + 1
cumulative_sum = 0
for i in range(M+1):
cumulative_sum += m[i]
if cumulative_sum > k:
return i
This is obviously bad, because it takes O(M)
time and O(M)
space as well.
Is there any good way to update quickselect (or do something else entirely) to solve the problem in O(u)
time and O(u)
space?
As @kcsquared noted, if the input array is given as-is, there is no way to break the Omega(n)
time limit. Does anything change if the input is in format [(v1, c1), (v2, c2), ..., (vn, cn)]
, where (v, c)
corresponds to one unique value; v
being the value and c
being the number of its occurences in the original input?
For memory, yes.
Create a hash mapping values to count. This hash will have size O(u)
. And then you can do a quickselect giving each value a weight equal to the count.
But for time, you have to read the whole array which is O(n)
. Unless you are happy with an approximate answer. In which case you can take a random selection from the array, figure out a hash of approximate counts, and quickselect that. Depending on the purpose, that may be close enough.