pythonarraysnumpymatrixnearest-neighbor

How to reproduce `kneighbors_graph(include_self=True)` using `KNeighborsTransformer` in sklearn?


My ultimate goal is replace some methods that use kneighbors_graph with transformers from the sklearn-ann package. All the methods in sklearn-ann are implemented as sklearn-compatible transformer objects. However, the function I'm trying to replace uses kneighbors_graph(mode="connectivity", include_self=True) and I'm having a hard time converting the distance output with include_self=False to this type of connectivity matrix. Not all the transformer objects allow for connectivity mode while including self but all provide access to distance calculations without self.

I'm able to reproduce the kneighbors_graph(mode="connectivity", include_self=True) from kneighbors_graph(mode="distance", include_self=True) (referring to as nn_with_self). However, I'm unable to reproduce it from kneighbors_graph(mode="distance", include_self=False) (referring to as nn_without_self) which is the same output as KNeighborsTransformer(mode="distance").fit_transform.

I see that the nn_without_self is a super set of nn_with_self but I don't know how the backend algorithm selects which fields are kept.

How can I recreate nn_with_self from the nn_without_self matrix below?

Further, how can I operate on sparse matrices the whole time without converting to dense matrices?

I tried looking at the backend code but it's like an inception of class inheritance and I find myself pouring through several files at the same time losing track on the GitHub.

from sklearn.datasets import make_classification
from sklearn.neighbors import kneighbors_graph, KNeighborsTransformer

X, _ = make_classification(n_samples=10, n_features=4, n_classes=2, n_clusters_per_class=1, random_state=0)
n_neighbors=3

# Nearest neighbors
nn_with_self = kneighbors_graph(X, n_neighbors=n_neighbors, mode="distance", metric="euclidean", include_self=True,n_jobs=-1).todense()
nn_without_self = kneighbors_graph(X, n_neighbors=n_neighbors, mode="distance", metric="euclidean", include_self=False,n_jobs=-1).todense()
nn_from_transformer = KNeighborsTransformer(mode="distance", n_neighbors=n_neighbors, metric="euclidean", n_jobs=-1).fit_transform(X)

np.all(nn_from_transformer == nn_without_self)
# True

np.all(nn_with_self == nn_without_self)
# False

# Is `nn_with_self` symmetric?
np.allclose(nn_with_self,nn_with_self.T)
# False

# Is `nn_without_self` symmetric?
np.allclose(nn_without_self,nn_without_self.T)
# False

Here are the actual arrays:

nn_with_self
# matrix([[0.        , 0.70550439, 0.        , 0.20463097, 0.        ,
#          0.        , 0.        , 0.        , 0.        , 0.        ],
#         [0.        , 0.        , 0.        , 0.51947869, 0.        ,
#          0.        , 0.        , 0.        , 0.        , 0.44145655],
#         [0.        , 0.        , 0.        , 0.        , 0.50025504,
#          0.        , 0.        , 0.        , 0.49481662, 0.        ],
#         [0.20463097, 0.51947869, 0.        , 0.        , 0.        ,
#          0.        , 0.        , 0.        , 0.        , 0.        ],
#         [0.        , 0.        , 0.50025504, 0.        , 0.        ,
#          0.        , 0.        , 0.        , 0.34132965, 0.        ],
#         [0.        , 0.88867318, 0.        , 0.        , 0.        ,
#          0.        , 0.        , 0.        , 0.        , 0.44956691],
#         [0.        , 0.        , 1.10390699, 0.        , 1.52953542,
#          0.        , 0.        , 0.        , 0.        , 0.        ],
#         [0.        , 0.        , 0.        , 0.        , 0.        ,
#          3.62670755, 0.        , 0.        , 0.        , 3.83571739],
#         [0.        , 0.        , 0.49481662, 0.        , 0.34132965,
#          0.        , 0.        , 0.        , 0.        , 0.        ],
#         [0.        , 0.44145655, 0.        , 0.        , 0.        ,
#          0.44956691, 0.        , 0.        , 0.        , 0.        ]])

nn_without_self
# matrix([[0.        , 0.70550439, 0.        , 0.20463097, 1.02852831,
#          0.        , 0.        , 0.        , 0.        , 0.        ],
#         [0.70550439, 0.        , 0.        , 0.51947869, 0.        ,
#          0.        , 0.        , 0.        , 0.        , 0.44145655],
#         [0.        , 0.        , 0.        , 0.        , 0.50025504,
#          0.        , 1.10390699, 0.        , 0.49481662, 0.        ],
#         [0.20463097, 0.51947869, 0.        , 0.        , 0.        ,
#          0.        , 0.        , 0.        , 0.        , 0.95611187],
#         [1.02852831, 0.        , 0.50025504, 0.        , 0.        ,
#          0.        , 0.        , 0.        , 0.34132965, 0.        ],
#         [0.        , 0.88867318, 0.        , 1.40547465, 0.        ,
#          0.        , 0.        , 0.        , 0.        , 0.44956691],
#         [0.        , 0.        , 1.10390699, 0.        , 1.52953542,
#          0.        , 0.        , 0.        , 1.59848513, 0.        ],
#         [0.        , 4.1280709 , 0.        , 0.        , 0.        ,
#          3.62670755, 0.        , 0.        , 0.        , 3.83571739],
#         [1.36553076, 0.        , 0.49481662, 0.        , 0.34132965,
#          0.        , 0.        , 0.        , 0.        , 0.        ],
#         [0.        , 0.44145655, 0.        , 0.95611187, 0.        ,
#          0.44956691, 0.        , 0.        , 0.        , 0.        ]])

Solution

  • Just do n_neighbors - 1 for include_self=True.

    from sklearn.datasets import make_classification
    from sklearn.neighbors import kneighbors_graph, KNeighborsTransformer
    
    X, _ = make_classification(n_samples=10, n_features=4, n_classes=2, n_clusters_per_class=1, random_state=0)
    n_neighbors=3
    
    # Nearest neighbors
    nn_with_self = kneighbors_graph(X, n_neighbors=n_neighbors, mode="distance", metric="euclidean", include_self=True,n_jobs=-1).todense()
    nn_from_transformer = KNeighborsTransformer(mode="distance", n_neighbors=n_neighbors - 1, metric="euclidean", n_jobs=-1).fit_transform(X).todense()
    
    np.allclose(nn_from_transformer, nn_with_self)
    # True