machine-learningscipytreenearest-neighborkdtree

Understanding `leafsize` in scipy.spatial.KDTree


Problem statement:

I have 150k points in a 3D space with their coordinates stored in a matrix with dimension [150k, 3] in mm.

I want to find all the neighbors of a given point p that are within a radius r. And I want to do that in the most accurate way.

How should I choose my leafsize parameter ?

from scipy.spatial import KDTree
import numpy as np

pts = np.random.rand(150000,3)

T1 = KDTree(pts, leafsize=20)
T2 = KDTree(pts, leafsize=1)

neighbors1= T1.query_ball_point((0.3,0.2,0.1), r=2.0)
neighbors2= T2.query_ball_point((0.3,0.2,0.1), r=2.0)

np.allclose(sorted(neighbors1), sorted(neighbors2))
True

Solution

  • The function query_ball_point will return the correct set of points for any version of the search tree. The leafsize parameter does not impact the results of the query, only the performance of the results.

    Imagine two trees shown below for the same data (but different leafsize parameters) and a query searching for all points inside the red circle.

    Example Search Trees

    In both cases, the code will only return the two points that lie inside the red circle. This is done by checking all points in all boxes of the tree that intersect the circle. This leads to a different amount of work (i.e., different performance) in each case. For the left tree (corresponding to a larger leafsize), the algorithm has to check if 13 points are inside the circle (6 in the upper intersecting box and 7 in the lower intersecting box). In the right tree (which has a smaller leaf size), only three points get processed (one in the upper intersecting box and two in the lower intersecting box).

    Following this logic, you may think it just makes sense to always use a small leaf size: this will minimize the number of actual comparisons at the end of the algorithm (do decide if the points actually lie in the query region). But it isn't that simple: the smaller leaf size will generate a deeper tree adding cost to the construction time and to the tree traversal time. Getting the right balance of tree-traversal performance with the leaf-level comparisons really depends on the type of data going into the tree and the specific leaf-level comparisons you are doing. Which is why scipy provides the leafsize parameter as an argument so you can tune things to perform best on a particular algorithm.