pythonscipypdist

scipy pdist getting only two closest neighbors


I've been computing pairwise distances with scipy, and I am trying to get distances to two of the closest neighbors. My current working solution is:

dists = squareform(pdist(xs.todense()))
dists = np.sort(dists, axis=1)[:, 1:3]

However, the squareform method is spatially very expensive and somewhat redundant in my case. I only need the two closest distances, not all of them. Is there a simple workaround?

Thanks!


Solution

  • The relation between linear index and the (i, j) of the upper triangle distance matrix is not directly, or easily, invertible (see note 2 in squareform doc).

    However, by looping over all indices the inverse relation can be obtained:

    import numpy as np
    import matplotlib.pyplot as plt
    
    from scipy.spatial.distance import pdist
    
    def inverse_condensed_indices(idx, n):
        k = 0
        for i in range(n):
            for j in range(i+1, n):
                if k == idx:
                    return (i, j)
                k +=1
        else:
            return None
    
    # test
    points = np.random.rand(8, 2)
    distances = pdist(points)
    sorted_idx = np.argsort(distances)
    n = points.shape[0]
    ij = [inverse_condensed_indices(idx, n)
          for idx in sorted_idx[:2]]
    
    # graph
    plt.figure(figsize=(5, 5))
    for i, j in ij:
        x = [points[i, 0], points[j, 0]]
        y = [points[i, 1], points[j, 1]]
        plt.plot(x, y, '-', color='red');
    
    plt.plot(points[:, 0], points[:, 1], '.', color='black');
    plt.xlim(0, 1); plt.ylim(0, 1);
    

    It seems to be a little faster than using squareform:

    %timeit squareform(range(28))
    # 9.23 µs ± 63 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    
    %timeit inverse_condensed_indices(27, 8)
    # 2.38 µs ± 25 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)