I am measuring the euclidean distances between multiple points, with their coordinates stored in an array.
from sklearn.metrics.pairwise import euclidean_distances
points = [[1,2], [1,3], [4,5], [2,6]]
distances = euclidean_distances(points)
distances
array([[0. , 1. , 4.24264069, 4.12310563],
[1. , 0. , 3.60555128, 3.16227766],
[4.24264069, 3.60555128, 0. , 2.23606798],
[4.12310563, 3.16227766, 2.23606798, 0. ]])
In the array that is returned, every value occurs twice. Is there a way to efficiently return values that only occur once? This would be my preferred outcome:
[1.0, 4.242640687119285, 4.123105625617661, 3.605551275463989, 3.1622776601683795, 2.23606797749979]
I looked at the documentation for the euclidean_distances formula, but there does not seem to be an argument to exclude double values.
I can exclude the double values the following way:
dist_list = []
for i in range(len(distances)):
unique_dist = distances[i][i+1:]
dist_list.extend(unique_dist)
but I am wondering if there is a more efficient way. I do not want to use unique(), as there might be double distances in my data.
Numpy has a very useful to extract the indices of the upper (or lower) triangular part of a matrix. I set k=1
to exclude the diagonal part here, if you want to include it, use k=0
.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
points = [[1,2], [1,3], [4,5], [2,6]]
distances = euclidean_distances(points)
print(distances[np.triu_indices_from(distances, k=1)])
array([1. , 4.24264069, 4.12310563, 3.60555128, 3.16227766,
2.23606798])