pythonspatial

Calculating bilateral distances across a large dataframe (>1 mil rows) of longitude and latitude


Hi i've attempted this with pandas by creating a distance matrix of all the 1 mil locations (so a symmetric matrix of 1 million by 1 million) needless to say it does not work as it is too big for pandas :(

Trying to find an alternative to this. Have tried dask and Vaex but am struggling. Can i get some help please? WHat would you use if you had to create a distance matrix to calculate pairwise distances between > 1 mill of locations

Ideally a distance matrix of pairwise distances


Solution

  • When I was working with loads of lat/lon data in the past, we'd use triangular mesh decomposition to decide if a point was within range of another point before we'd do any further math on it. Since these kind of hierarchical decompositions are "simply" checking integers for containment, it was easier to decide if they were close enough to do any expensive geodesics calculations on them.

    Since my solution is not opensource, I'd recommend that you check out the S2 Geometry library. This can convert your lat/lon into 64-bit integers that hold the property that "Two points that are close in number are close in space, though two points close in space are not necessarily close in number". But, by creating Cells you can easily bucket your million points into smaller less O(N^2) intensive calculations by significantly filtering N to only your points close to your point of interest. Also, these buckets lend themselves to embarrassingly parallel sub-problems.

    That's how I did it.