I'm wondering what the best approach for finding the closest match to a given value from a collection of items is. The most important part is the lookup time relative to the input size, the data can be shuffled and moved around as much as needed, as long as the lookup is therefore faster.
Here the initial script:
MAX = 1_000_000
MIN = 0
target: float = 333_333.0
collection: set[float] = {random.uniform(MIN, MAX) for _ in range(100_000)}
# --- Code here to find closest as fast as possible, now and for future lookups ---
assert closest in collection and all(abs(target - v) >= delta for v in collection)
Iterate through all values and update closest
accordingly. Very simple but very slow for big collections.
closest: float = next(iter(collection)) # get any element
delta: float = abs(closest - target)
for v in collection:
tmp_delta = abs(v - target)
if tmp_delta < delta:
closest = v
delta = tmp_delta
Sort data and then find closest via binary search. Sort time is O(n log n), but future lookups will only take O(log n) time to find!
sorted_collection: list[float] = sorted(collection)
idx = bisect.bisect_right(sorted_collection, target)
if idx == 0:
return sorted_collection[0]
if idx == len(sorted_collection):
return sorted_collection[-1]
before, after = sorted_collection[idx - 1], sorted_collection[idx]
return before if target - before <= after - target else after
Using dict
with some custom hashing? I thought about implemenenting a __hash__
method for a custom class that could then be used to find values with (armored) O(1) lookup time, but couldn't get anything quite working without making some previous assumptions about the data involved and wonder if it is even possible.
And that is where I got to in a nutshell. I am wondering if there is a faster way than the simple sorting + binary search approach and if my idea with dict
+ custom hash function is somehow possible.
With floats, the binary search scales like the best known. However it requires looking at locations in memory at some distance from each other. A B-tree also scales like O(log(n))
, but by locating sets of decisions close to each other, it gets an order of magnitude constant improvement.
The hash idea doesn't work because a good hash function will scatter nearby values in a pseudorandom way. And therefore you have no better way to find those nearby values than to search the whole hash.