rloopsstringdist

Speeding up a nested loop in R for distance comparison


I have 2 dataframes - STORE_LIST_A (50,000 rows) & STORE_LIST_B (30,000 rows). Both these dataframes contain these 3 columns - STORE_ID, LATITUDE, LONGITUDE

STORE_ID LATITUDE LONGITUDE
ABPS693P 8.0951349 77.5472305
ABTS730P 8.1024627 77.5581893

What I want to do is to calculate the distance of each store in STORE_LIST_A with all the stores in STORE_LIST_B, and only store those pairs which are within a distance of 1000 metre of each other.

I have tried to do this using a nested loop where each store in STORE_LIST_A is compared to each STORE in STORE_LIST_B and each pair within 1000 metre of each other is added to an empty dataframe using rbind.

But the problem with this approach is that it takes days for this loop to run, since there are 50000 rows in STORE_LIST_A and 30000 rows in STORE_LIST_B, so essentially there are 50000*30000 distance calculations. Can someone please suggest a computationally efficient way to deal with this problem?


Solution

  • Ok here an attempt :) I ran a few times against memory limits, but it probably depends on how close your stores are. I found a nice data set with all companies in San Francisco so ran the code fully on 30k * 50k combinations BUT my criteria was set on roughly 250 m apart.

    libraries used

    library(data.table)
    library(geosphere)
    

    just to prepare a similar data set

    # downloaded all stores in SF
    # https://towardsdatascience.com/plotting-spatial-data-in-r-a38a405a07f1
    dt <- fread("sf_business_dataset.csv")
    
    # prepare and convert the data comparable to your format
    dt <- dt[, .SD, .SDcols = c("Location Id", "Business Location")]
    dt[, c("LATITUDE", "LONGITUDE") := lapply(tstrsplit(gsub(".*\\((.+)\\)", "\\1", `Business Location`), ","), as.numeric)]
    dt <- dt[!is.na(LATITUDE) & !is.na(LONGITUDE), .(STORE_ID = `Location Id`, LATITUDE, LONGITUDE)]
    # randomize, not sure how it was ordered, so we get a randomized A and B list.
    dt[, rIndex := sample(1:.N, .N, replace = F)]
    setorder(dt, rIndex)
    dt[, rIndex := NULL]
    

    Lets play around, with n of 10000 we create your size, make smaller to just test the code. After this we have the sample data to play with.

    # simulate your sample sizes
    n <- 10000
    STORE_LIST_A <- dt[1:(3*n)]
    STORE_LIST_B <- dt[(3*n+1):((3+5)*n)]
    

    solution

    This looks a bit redundant, but roughly 0.008 decimal difference equals about 1km, we are not interested in any distance calculation exceeding that. Note on the SF data I made it smaller as we would have too many combinations! I decided to use foverlaps as post filtering on cartesian join did result in 2.5 bilion combinations, good luck :)

    r <- 0.002 # set to 0.008 roughly for 1km
    

    Note that in A we set the ranges, in B we "clone" the start/end. We basically split the data and do the latitudes and longitudes seperately, if either one of them is more than 1km apart, the distance between A and B will be too!

    STORE_LIST_A[, LATITUDE_start := LATITUDE - r][, LATITUDE_end := LATITUDE + r]
    STORE_LIST_A[, LONGITUDE_start := LONGITUDE - r][, LONGITUDE_end := LONGITUDE + r]
    
    STORE_LIST_B[, LATITUDE_start := LATITUDE][, LATITUDE_end := LATITUDE]
    STORE_LIST_B[, LONGITUDE_start := LONGITUDE][, LONGITUDE_end := LONGITUDE]
    
    setkey(STORE_LIST_A, LATITUDE_start, LATITUDE_end)
    setkey(STORE_LIST_B, LATITUDE_start, LATITUDE_end)
    
    close_lat <- foverlaps(STORE_LIST_B, STORE_LIST_A, type = "within", nomatch = 0) # 62M matches
    
    setkey(STORE_LIST_A, LONGITUDE_start, LONGITUDE_end)
    setkey(STORE_LIST_B, LONGITUDE_start, LONGITUDE_end)
    
    close_lon <- foverlaps(STORE_LIST_B, STORE_LIST_A, type = "within", nomatch = 0) # 53M matches
    

    Now we have to inner join the two as the distance is only small enough if both distances between longitude and latitude are small enough. This below on this data resulted in "only" 5.3M matches for which you have to calculate the distance.

    close_enough <- close_lat[close_lon, on = .(STORE_ID, i.STORE_ID), .(STORE_ID, LONGITUDE, LATITUDE, i.STORE_ID, i.LONGITUDE, i.LATITUDE), nomatch = 0]
    

    Do some cleaning up and filter on the actually calculated distances, note that this is still pretty slow! But you know we only have to calculate this for 5.6M cases here and not for 1500M :)

    close_enough[, distance := distHaversine(c(LONGITUDE, LATITUDE), c(i.LONGITUDE, i.LATITUDE)), .(STORE_ID, i.STORE_ID)][distance < 1000]
    

    sample output (from a smaller test run)

               STORE_ID LONGITUDE LATITUDE     i.STORE_ID i.LONGITUDE i.LATITUDE distance
      1: 0375071-01-001   -122.49   37.761 0465561-01-001     -122.49     37.761   51.190
      2: 0925437-02-001   -122.49   37.781 1030867-06-151     -122.49     37.780  142.786
      3: 0434463-01-001   -122.49   37.780 1030867-06-151     -122.49     37.780  136.233
      4: 0460546-01-001   -122.48   37.782 0357002-01-001     -122.48     37.783  174.721
      5: 0433580-02-001   -122.48   37.764 0412914-01-001     -122.48     37.763   91.625
     ---                                                                                 
    500: 1020837-02-151   -122.39   37.762 1005253-08-141     -122.39     37.760  173.533
    501: 0436520-02-001   -122.39   37.760 0371484-02-001     -122.39     37.760    0.000
    502: 0474822-01-001   -122.39   37.760 0371484-02-001     -122.39     37.760   33.514
    503: 1020837-02-151   -122.39   37.762 0371484-02-001     -122.39     37.760  173.533
    504: 1022102-03-151   -122.39   37.740 0470578-01-001     -122.39     37.738  238.217