machine-learningknnprocessing-efficiencysupervised-learning

K-Nearest Neighbor - how many reference points/features?


I want to use KNN to create a training model (I will use other ML models as well), but i'm just wondering...

I have around 6 features, with a total of let's say 60.000 (60 thousand) reference points (so, I have around 10.000 reference points per feature).

I know that this is, from a computational point of view, not ideal (for an algorithm like KNN), so should I use for example KD-Trees (or is KNN okay for this number of features/reference points)? Because.. if I have to calculate the distance between my test point and all the reference points (with for example Euclidean distance, for a multi-dimensional model)..... I can imagine that it will take quite some time..?

I know that other (supervised) ML algorithms are maybe more efficient, but KNN is only one of the algorithms I will use.


Solution

  • The time complexity of (naive) KNN would be O(kdn) where d is the dimensionality which is 6 in your case, and n is the number of points, which is 60,000 in your case.

    Meanwhile, building a KD tree from n points is O(dnlogn), with subsequent nearest-neighber lookups taking O(klogn) time. This is definitely much better: you sacrifice a little bit of time upfront to build the KD tree, but each KNN lookup later is much faster.

    This is all under the assumption that your points are distributed in a "nice" way (see: https://en.wikipedia.org/wiki/K-d_tree#Degradation_in_performance_when_the_query_point_is_far_from_points_in_the_k-d_tree for more details). If they aren't distributed in a "nice" way, then KNN in general might not be the way to go.