pythonnumpyscipyscipy-spatial

In python is there something similar to scipy.spatial.distance.cdist but for displacements (fast)?


I've been working on a research code for a few weeks and have been trying to speed it up by using cdist rather than a multi-level for loop to calculate the distances between every point in a matrix.

What I want:

    from scipy.spatial.distance import cdist
    import numpy as np

    a=np.array([[1],[2],[3]])
    cdist(a,a, lambda u,v: u-v)
[[ 0. -1. -2.]
 [ 1.  0. -1.]
 [ 2.  1.  0.]]

However, my issue is that in the context of my research a is pretty big and using a custom lambda function in cdist is significantly slower (~2 orders of magnitude) than cdist(a,a) - but this only gives positive values. i.e. In reality, I have to calculate this 15,000 times where a has 1,000 elements, so those 2oom matter a lot.

Note cdist(a,a) does not give the desired output as it is all positive values.

[[0. 1. 2.]
 [1. 0. 1.]
 [2. 1. 0.]]

I'm hoping you guys might have suggestions for how I could do something to create the desired signed output from cdist but more quickly than using a lambda function.

Thank you!


Solution

  • Depending on your distance metric and the the kind of data you have, you have different options:

    For your specific case, where the data is 1D and |u-v| == ( (u-v)^2 )^(1/2) you could just use your knowledge that the upper and the lower triangle of the distance matrix are equal in absolute terms and only differ with respect to the sign, so you can avoid a custom distance function:

    d = cdist(a, a)
    
    triu_bool = np.triu(np.ones((n_samples, n_samples), dtype=bool))
    triu_bool[range(n_samples), range(n_samples)] = False
    d[triu_bool] *= -1
    # [[ 0. -1. -2.]
    #  [ 1.  0. -1.]
    #  [ 2.  1.  0.]]
    

    The more general, and in my eyes better, approach is to simply use numpys broadcasting (see also this question/answer). Here an example for u-v:

    # Generate data
    n_dim = 3
    n_samples = int(1.5e4)
    arr = np.concatenate([np.arange(n_samples)[:, np.newaxis]] * n_dim, axis=-1)
    # array([[    0,     0,     0],
    #        [    1,     1,     1],
    #        [    2,     2,     2],
    #        ...,
    #        [14997, 14997, 14997],
    #        [14998, 14998, 14998],
    #        [14999, 14999, 14999]])
    
    # u - v
    d = arr[:, np.newaxis, :] - arr[np.newaxis, :, :]
    # (n_samples, n_samples, n_dim)
    

    For symmetric distance measures half of the calculations are unnecessary. But in my experience it is still faster than trying to apply the calculation only to the upper triangle or something similar.