pythonpysparkkdtree

KD-Tree and Ball-Tree algorithm appearing wrong (to my untrained eye)


The KD-Tree and Ball-Tree algorithm are giving me unexpected results in a new pyspark notebook running on databricks.

I have created a new databricks notebook (pyspark) and copied the following code in:

import numpy as np
from sklearn.neighbors import BallTree
matrix = np.array([
          [0, 1, 4, 2, 1],
          [1, 0, 2, 1, 1],
          [4, 2, 0, 3, 1],
          [2, 1, 3, 0, 1],
          [1, 1, 1, 1, 0]
      ])
tree = BallTree(matrix)
dist, ind = tree.query(matrix, k=5)
print(ind)

And the ind output is:

[[0 1 3 4 2]
 [1 4 3 0 2]
 [2 4 1 3 0]
 [3 1 4 0 2]
 [4 1 3 0 2]]

My expectation would be that the first row should read [[0 1 4 3 2] or it should read [[0 4 1 3 2], rather than [[0 1 3 4 2]. Have I missed something / misunderstood? The same also occurs when I use a KDTree instead of a ball tree.


Solution

  • You should take into account that BallTree and KDTree took as their inputs the coordianates of points in N-dim space (not the matrix of pairwise distances). The following calculation proves that the result returned by BallTree is correct:

    from scipy.spatial import distance_matrix
    
    distance_matrix(X, X)**2
    
    # output
    array([[ 0.,  7., 34.,  9., 12.],
           [ 7.,  0., 21.,  4.,  3.],
           [34., 21.,  0., 23., 16.],
           [ 9.,  4., 23.,  0.,  7.],
           [12.,  3., 16.,  7.,  0.]])
    
    (distance_matrix(X, X)**2).argsort(axis=1)
    
    # output
    array([[0, 1, 3, 4, 2],
           [1, 4, 3, 0, 2],
           [2, 4, 1, 3, 0],
           [3, 1, 4, 0, 2],
           [4, 1, 3, 0, 2]])