pythonscipysparse-matrixdistance-matrix

Produce sparse pairwise distance matrix python avoiding memory error


I have some geospatial data and I am trying to produce pairwise haversine distances between points. Unfortunately, there are around 50k data points, which will produce a 50000x50000 distance matrix. Since this distance matrix is symmetric, I was thinking that this matrix could be sparse. I would like to be able to produce a sparse matrix straight from the distance computation, so I don't get memory errors. Here is my code so far:

def convert_to_arrays(df1, df2):
    d1 = np.array(df1[['x','y']].values.tolist())
    d2 = np.array(df2[['x','y']].values.tolist())
    return d1,d2

def haversine(data1, data2):
    data1 = np.deg2rad(data1)                     
    data2 = np.deg2rad(data2)                     

    lat1 = data1[:,0]                     
    lng1 = data1[:,1]         

    lat2 = data2[:,0]                     
    lng2 = data2[:,1]         

    diff_lat = lat1[:,None] - lat2
    diff_lng = lng1[:,None] - lng2
    d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

dist = haversine(*convert_to_arrays(df, df))

Where df is a pandas data frame with columns 'x' and 'y'. When I run this code with around 50k points, I get a memory error:

MemoryError: Unable to allocate 19.3 GiB for an array with shape (50893, 50893) and data type float64

Is there a way to ensure that the calculation is performed with a (lower or upper triangular) sparse matrix as the intended output?


Solution

  • The way I found to accomplish this was pretty brute force; if you have a more elegant solution please post it. Essentially, I take the points from the first data frame in chunks and compute pairwise distances with all the points in the second data frame, yielding an Nx50893 array. Then I convert this array to a sparse lower triangular matrix. Finally, I compute the next chunk, convert to lower triangular and stack these arrays.

    def haversine(data1, data2):
        data1 = np.deg2rad(data1)                     
        data2 = np.deg2rad(data2)                     
    
        lat1 = data1[:,0]                     
        lng1 = data1[:,1]         
    
        lat2 = data2[:,0]                     
        lng2 = data2[:,1]         
    
        #chunk it out 
        N = 2000 #chunk size
        start = 0
        end = N
        #First iteration
        diff_lat = lat1[:N,None] - lat2
        diff_lng = lng1[start:end,None] - lng2()
        d = np.sin(diff_lat/2)**2 + np.cos(lat1[start:end,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
        D = scipy.sparse.tril((2 * 6378137 * np.arcsin(np.sqrt(d))).astype(np.float32))
        start = end
        end += N
        L = len(lat1) - end
        #A bunch more iterations
        while L>0:
            diff_lat = lat1[start:end,None] - lat2
            diff_lng = lng1[start:end,None] - lng2
            d = np.sin(diff_lat/2)**2 + np.cos(lat1[start:end,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
            d = scipy.sparse.tril((2 * 6378137 * np.arcsin(np.sqrt(d))).astype(np.float32))
            D = scipy.sparse.vstack([D,d])
            start = end
            end += N
            L = len(lat1) - end
        #Last iteration    
        diff_lat = lat1[start:,None] - lat2
        diff_lng = lng1[start:,None] - lng2
        d = np.sin(diff_lat/2)**2 + np.cos(lat1[start:,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
        d = scipy.sparse.tril((2 * 6378137 * np.arcsin(np.sqrt(d))).astype(np.float32))
        D = scipy.sparse.vstack([D,d])
            
        return D
    
    dist = haversine(*convert_to_arrays(df, df)