I am trying to build a distance matrix with array of different lenghts. The distance metric is hausdorff distance which is suitable for this kind of operations. Nonetheless I cant find a way to build a distance matrix using the scipy.cdist
function.
I looked here for scipy cdist docs and here for hausdorff distance pip install traj-dist
and here similar question.
Now I can get the distance between two arrays with either scipy
or traj_dist
libraries.
import numpy as np
from scipy.spatial import distance
import traj_dist.distance as tdist
from scipy.spatial.distance import directed_hausdorff
from sklearn.metrics import pairwise_distances
# np.float64 needed for tdist import
arr1 = np.array([10,22,1,22,32,88],
dtype=np.float64).reshape(3,2)
arr2 = np.array([1,22,32,88,55,11,99,1233],
dtype=np.float64).reshape(4,2)
# measuring just for 1 array at the time works fine
tdist.hausdorff(array_of_arrays[0],array_of_arrays[1])
directed_hausdorff(array_of_arrays[0],array_of_arrays[1])
I can calculate a distance matrix with a nested for loop, but that is extremely slow when n_observation
is big.
n_observations = array_of_arrays.shape[0]
distance_matrix = np.zeros((n_observations, n_observations))
for i in range(n_observations):
for j in range(i + 1, n_observations):
dist = tdist.hausdorff(np.float64(array_of_arrays[i]),
np.float64(array_of_arrays[j]),
type_d='spherical')
distance_matrix[i, j] = dist
distance_matrix[j, i] = dist
But I cant get it to work on a bigger scale using scipy.cdist
.
array_of_arrays = np.array([arr1, arr2])
distance.cdist(array_of_arrays, array_of_arrays,
lambda x, y: tdist.hausdorff(x,y))
distance.cdist(array_of_arrays, array_of_arrays,
lambda x, y: directed_hausdorff(x,y))
The sklearn.metrics.pairwise_distance
does not work either
pairwise_distances(array_of_arrays, metric=tdist.hausdorff)
The question is: How to reshape the array_of_arrays
to use scipy.cdist
on it?
Bonus sub-question:If scipy.cdist
is not suited for such a task, what should I use to avoid the nested for loops and calculate a distance_matrix?
From what I see the problem is that that directed Hausdorff distance is computed between two arrays with dimensions of (N,D) and (M,D) and it is not defined for the one-dimensional arrays (i.e. vectors). cdist takes only 2D arrays as inputs which means that each row of these tensors is treated as a separate observation. To solve this problem using SciPy's cdist you would need to be able to pass 3D tensors to cdist which it doesn't allow by design (it checks number of tensor's dimensions).
One possible workaround (which is not easy) is to take the source code of the directed Hausdorff distance written in Cython (https://github.com/scipy/scipy/blob/v1.5.2/scipy/spatial/_hausdorff.pyx) and try to vectorize it.