[SOLVED] KD-Tree and Ball-Tree algorithm appearing wrong (to my untrained eye)

KD-Tree and Ball-Tree algorithm appearing wrong (to my untrained eye)

The KD-Tree and Ball-Tree algorithm are giving me unexpected results in a new pyspark notebook running on databricks.

I have created a new databricks notebook (pyspark) and copied the following code in:

import numpy as np
from sklearn.neighbors import BallTree
matrix = np.array([
          [0, 1, 4, 2, 1],
          [1, 0, 2, 1, 1],
          [4, 2, 0, 3, 1],
          [2, 1, 3, 0, 1],
          [1, 1, 1, 1, 0]
      ])
tree = BallTree(matrix)
dist, ind = tree.query(matrix, k=5)
print(ind)

And the ind output is:

[[0 1 3 4 2]
 [1 4 3 0 2]
 [2 4 1 3 0]
 [3 1 4 0 2]
 [4 1 3 0 2]]

My expectation would be that the first row should read [[0 1 4 3 2] or it should read [[0 4 1 3 2], rather than [[0 1 3 4 2]. Have I missed something / misunderstood? The same also occurs when I use a KDTree instead of a ball tree.

Solution

You should take into account that BallTree and KDTree took as their inputs the coordianates of points in N-dim space (not the matrix of pairwise distances). The following calculation proves that the result returned by BallTree is correct:

from scipy.spatial import distance_matrix

distance_matrix(X, X)**2

# output
array([[ 0.,  7., 34.,  9., 12.],
       [ 7.,  0., 21.,  4.,  3.],
       [34., 21.,  0., 23., 16.],
       [ 9.,  4., 23.,  0.,  7.],
       [12.,  3., 16.,  7.,  0.]])

(distance_matrix(X, X)**2).argsort(axis=1)

# output
array([[0, 1, 3, 4, 2],
       [1, 4, 3, 0, 2],
       [2, 4, 1, 3, 0],
       [3, 1, 4, 0, 2],
       [4, 1, 3, 0, 2]])