pythonlistscikit-learneuclidean-distance

How to exclude double values in sklearn.metrics.pairwise.euclidean_distances results


I am measuring the euclidean distances between multiple points, with their coordinates stored in an array.

from sklearn.metrics.pairwise import euclidean_distances
points = [[1,2], [1,3], [4,5], [2,6]]

distances = euclidean_distances(points)
distances
array([[0.        , 1.        , 4.24264069, 4.12310563],
       [1.        , 0.        , 3.60555128, 3.16227766],
       [4.24264069, 3.60555128, 0.        , 2.23606798],
       [4.12310563, 3.16227766, 2.23606798, 0.        ]])

In the array that is returned, every value occurs twice. Is there a way to efficiently return values that only occur once? This would be my preferred outcome:

[1.0, 4.242640687119285, 4.123105625617661, 3.605551275463989, 3.1622776601683795, 2.23606797749979]

I looked at the documentation for the euclidean_distances formula, but there does not seem to be an argument to exclude double values.

I can exclude the double values the following way:

dist_list = []
for i in range(len(distances)):
    unique_dist = distances[i][i+1:]
    dist_list.extend(unique_dist)

but I am wondering if there is a more efficient way. I do not want to use unique(), as there might be double distances in my data.


Solution

  • Numpy has a very useful to extract the indices of the upper (or lower) triangular part of a matrix. I set k=1 to exclude the diagonal part here, if you want to include it, use k=0.

    import numpy as np
    from sklearn.metrics.pairwise import euclidean_distances
    points = [[1,2], [1,3], [4,5], [2,6]]
    
    distances = euclidean_distances(points)
    print(distances[np.triu_indices_from(distances, k=1)])
    
    array([1.        , 4.24264069, 4.12310563, 3.60555128, 3.16227766,
           2.23606798])