I have an interesting problem but cannot find the relevant literature, so need algorithem suggestions or just the correct way to phrase the problem so I can find the literature.
I'm given a vector of floats of varying length coming from a 2D array. I know the orientation (i.e. which axis I'm looking in). I need to find the start index of the vector. Essentially, I'm cross-correlating a short vector with a long vector.
I have implemented a brute force approach but naturally it grows as O(n^2)
where n
is the dimension of the 2D array.
Q1) What is an efficient algorithem to tackel this problem?
Q2) How is this type of problem called (so I can find relevant papers)?
There is measurement error so it will never be an exact match.
Here the brute force approach, where I look for the minimum of the norm of two vectors:
import time
import numpy as np
def get_distance(a: np.ndarray, b: np.ndarray) -> float:
""" Calculate the distance between two vectors. """
return np.linalg.norm(a - b)
def find_min_distance_subvector(array: np.ndarray, x: np.ndarray) -> tuple[int, float]:
leng = len(x)
min_index = 0
min_distance = np.inf
# Assuming we know the orientation of the vector
for ii in range(0, len(array) - leng):
# Extract a sub-vector of the same length as x, starting from index ii
sub_vec = array[ii:ii + leng]
dist = get_distance(sub_vec, x)
if dist < min_distance:
min_distance = dist
min_index = ii
return min_index, min_distance
def main():
leng = 100
size = 2000
# Create the search map
arr = np.random.random((size, size))
# Pick a random sub-vector from the map
index = np.random.randint(0, size - leng)
x = arr[index:index + leng]
start_time = time.time()
min_index, min_metric = find_min_distance_subvector(arr, x)
end_time = time.time()
print(f"Minimum distance: {min_metric} at index {min_index}, correct index: {index}, "
f"time taken: {end_time - start_time:.4f} seconds")
if __name__ == '__main__':
main()
Thank you for your help
For your case, to improve from O(n^2), consider following algorithms:
Instead of recalculating the distance for every sub-vector from scratch, you can maintain a running distance calculation as you slide through the array. This can significantly reduce the number of calculations needed.
If the problem allows, you can use dynamic programming to store previously computed distances and reuse them, which can help in reducing redundant calculations.
If your vector and the 2D array have periodic properties, you might consider using FFT to find correlations efficiently. This can reduce the complexity to O(n log n).
If your problem can be framed as a substring search (where the vector is treated as a pattern), the KMP algorithm can be used to find occurrences of the vector in the 2D array efficiently.
For high-dimensional data, LSH can be used to approximate nearest neighbors efficiently, which might help in finding the start index of the vector.