I have some geospatial data and I am trying to produce pairwise haversine distances between points. Unfortunately, there are around 50k data points, which will produce a 50000x50000 distance matrix. Since this distance matrix is symmetric, I was thinking that this matrix could be sparse. I would like to be able to produce a sparse matrix straight from the distance computation, so I don't get memory errors. Here is my code so far:
def convert_to_arrays(df1, df2):
d1 = np.array(df1[['x','y']].values.tolist())
d2 = np.array(df2[['x','y']].values.tolist())
return d1,d2
def haversine(data1, data2):
data1 = np.deg2rad(data1)
data2 = np.deg2rad(data2)
lat1 = data1[:,0]
lng1 = data1[:,1]
lat2 = data2[:,0]
lng2 = data2[:,1]
diff_lat = lat1[:,None] - lat2
diff_lng = lng1[:,None] - lng2
d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
dist = haversine(*convert_to_arrays(df, df))
Where df
is a pandas data frame with columns 'x' and 'y'. When I run this code with around 50k points, I get a memory error:
MemoryError: Unable to allocate 19.3 GiB for an array with shape (50893, 50893) and data type float64
Is there a way to ensure that the calculation is performed with a (lower or upper triangular) sparse matrix as the intended output?
The way I found to accomplish this was pretty brute force; if you have a more elegant solution please post it. Essentially, I take the points from the first data frame in chunks and compute pairwise distances with all the points in the second data frame, yielding an Nx50893 array. Then I convert this array to a sparse lower triangular matrix. Finally, I compute the next chunk, convert to lower triangular and stack these arrays.
def haversine(data1, data2):
data1 = np.deg2rad(data1)
data2 = np.deg2rad(data2)
lat1 = data1[:,0]
lng1 = data1[:,1]
lat2 = data2[:,0]
lng2 = data2[:,1]
#chunk it out
N = 2000 #chunk size
start = 0
end = N
#First iteration
diff_lat = lat1[:N,None] - lat2
diff_lng = lng1[start:end,None] - lng2()
d = np.sin(diff_lat/2)**2 + np.cos(lat1[start:end,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
D = scipy.sparse.tril((2 * 6378137 * np.arcsin(np.sqrt(d))).astype(np.float32))
start = end
end += N
L = len(lat1) - end
#A bunch more iterations
while L>0:
diff_lat = lat1[start:end,None] - lat2
diff_lng = lng1[start:end,None] - lng2
d = np.sin(diff_lat/2)**2 + np.cos(lat1[start:end,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
d = scipy.sparse.tril((2 * 6378137 * np.arcsin(np.sqrt(d))).astype(np.float32))
D = scipy.sparse.vstack([D,d])
start = end
end += N
L = len(lat1) - end
#Last iteration
diff_lat = lat1[start:,None] - lat2
diff_lng = lng1[start:,None] - lng2
d = np.sin(diff_lat/2)**2 + np.cos(lat1[start:,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
d = scipy.sparse.tril((2 * 6378137 * np.arcsin(np.sqrt(d))).astype(np.float32))
D = scipy.sparse.vstack([D,d])
return D
dist = haversine(*convert_to_arrays(df, df)