The KD-Tree and Ball-Tree algorithm are giving me unexpected results in a new pyspark notebook running on databricks.
I have created a new databricks notebook (pyspark) and copied the following code in:
import numpy as np
from sklearn.neighbors import BallTree
matrix = np.array([
[0, 1, 4, 2, 1],
[1, 0, 2, 1, 1],
[4, 2, 0, 3, 1],
[2, 1, 3, 0, 1],
[1, 1, 1, 1, 0]
])
tree = BallTree(matrix)
dist, ind = tree.query(matrix, k=5)
print(ind)
And the ind output is:
[[0 1 3 4 2]
[1 4 3 0 2]
[2 4 1 3 0]
[3 1 4 0 2]
[4 1 3 0 2]]
My expectation would be that the first row should read [[0 1 4 3 2] or it should read [[0 4 1 3 2], rather than [[0 1 3 4 2]. Have I missed something / misunderstood? The same also occurs when I use a KDTree instead of a ball tree.
You should take into account that BallTree
and KDTree
took as their inputs the coordianates of points in N-dim space (not the matrix of pairwise distances). The following calculation proves that the result returned by BallTree
is correct:
from scipy.spatial import distance_matrix
distance_matrix(X, X)**2
# output
array([[ 0., 7., 34., 9., 12.],
[ 7., 0., 21., 4., 3.],
[34., 21., 0., 23., 16.],
[ 9., 4., 23., 0., 7.],
[12., 3., 16., 7., 0.]])
(distance_matrix(X, X)**2).argsort(axis=1)
# output
array([[0, 1, 3, 4, 2],
[1, 4, 3, 0, 2],
[2, 4, 1, 3, 0],
[3, 1, 4, 0, 2],
[4, 1, 3, 0, 2]])