numpyscikit-learnscipydistance-matrix

How to build efficiently a distance matrix by reshaping numpy array arrays for scipy.cdist or using sklearn?


I am trying to build a distance matrix with array of different lenghts. The distance metric is hausdorff distance which is suitable for this kind of operations. Nonetheless I cant find a way to build a distance matrix using the scipy.cdist function.

I looked here for scipy cdist docs and here for hausdorff distance pip install traj-dist and here similar question.

Now I can get the distance between two arrays with either scipy or traj_dist libraries.

import numpy as np
from scipy.spatial import distance
import traj_dist.distance as tdist
from scipy.spatial.distance import directed_hausdorff
from sklearn.metrics import pairwise_distances
# np.float64 needed for tdist import
arr1 = np.array([10,22,1,22,32,88],
                dtype=np.float64).reshape(3,2)
arr2 = np.array([1,22,32,88,55,11,99,1233],
                dtype=np.float64).reshape(4,2)
# measuring just for 1 array at the time works fine
tdist.hausdorff(array_of_arrays[0],array_of_arrays[1])
directed_hausdorff(array_of_arrays[0],array_of_arrays[1])

I can calculate a distance matrix with a nested for loop, but that is extremely slow when n_observation is big.

n_observations = array_of_arrays.shape[0]
distance_matrix = np.zeros((n_observations, n_observations))

for i in range(n_observations):
    for j in range(i + 1, n_observations):
        dist = tdist.hausdorff(np.float64(array_of_arrays[i]),
                               np.float64(array_of_arrays[j]),
                               type_d='spherical')
        distance_matrix[i, j] = dist
        distance_matrix[j, i] = dist

But I cant get it to work on a bigger scale using scipy.cdist.

array_of_arrays = np.array([arr1, arr2])

distance.cdist(array_of_arrays, array_of_arrays,
               lambda x, y: tdist.hausdorff(x,y))
distance.cdist(array_of_arrays, array_of_arrays,
               lambda x, y: directed_hausdorff(x,y))

The sklearn.metrics.pairwise_distance does not work either


pairwise_distances(array_of_arrays, metric=tdist.hausdorff)

The question is: How to reshape the array_of_arrays to use scipy.cdist on it?

Bonus sub-question:If scipy.cdist is not suited for such a task, what should I use to avoid the nested for loops and calculate a distance_matrix?


Solution

  • From what I see the problem is that that directed Hausdorff distance is computed between two arrays with dimensions of (N,D) and (M,D) and it is not defined for the one-dimensional arrays (i.e. vectors). cdist takes only 2D arrays as inputs which means that each row of these tensors is treated as a separate observation. To solve this problem using SciPy's cdist you would need to be able to pass 3D tensors to cdist which it doesn't allow by design (it checks number of tensor's dimensions).

    One possible workaround (which is not easy) is to take the source code of the directed Hausdorff distance written in Cython (https://github.com/scipy/scipy/blob/v1.5.2/scipy/spatial/_hausdorff.pyx) and try to vectorize it.