I have 2 dataframes - STORE_LIST_A
(50,000 rows) & STORE_LIST_B
(30,000 rows). Both these dataframes contain these 3 columns - STORE_ID, LATITUDE, LONGITUDE
STORE_ID | LATITUDE | LONGITUDE |
---|---|---|
ABPS693P | 8.0951349 | 77.5472305 |
ABTS730P | 8.1024627 | 77.5581893 |
What I want to do is to calculate the distance of each store in STORE_LIST_A
with all the stores in STORE_LIST_B
, and only store those pairs which are within a distance of 1000 metre of each other.
I have tried to do this using a nested loop where each store in STORE_LIST_A
is compared to each STORE in STORE_LIST_B
and each pair within 1000 metre of each other is added to an empty dataframe using rbind.
But the problem with this approach is that it takes days for this loop to run, since there are 50000 rows in STORE_LIST_A
and 30000 rows in STORE_LIST_B
, so essentially there are 50000*30000 distance calculations. Can someone please suggest a computationally efficient way to deal with this problem?
Ok here an attempt :) I ran a few times against memory limits, but it probably depends on how close your stores are. I found a nice data set with all companies in San Francisco so ran the code fully on 30k * 50k combinations BUT my criteria was set on roughly 250 m apart.
libraries used
library(data.table)
library(geosphere)
just to prepare a similar data set
# downloaded all stores in SF
# https://towardsdatascience.com/plotting-spatial-data-in-r-a38a405a07f1
dt <- fread("sf_business_dataset.csv")
# prepare and convert the data comparable to your format
dt <- dt[, .SD, .SDcols = c("Location Id", "Business Location")]
dt[, c("LATITUDE", "LONGITUDE") := lapply(tstrsplit(gsub(".*\\((.+)\\)", "\\1", `Business Location`), ","), as.numeric)]
dt <- dt[!is.na(LATITUDE) & !is.na(LONGITUDE), .(STORE_ID = `Location Id`, LATITUDE, LONGITUDE)]
# randomize, not sure how it was ordered, so we get a randomized A and B list.
dt[, rIndex := sample(1:.N, .N, replace = F)]
setorder(dt, rIndex)
dt[, rIndex := NULL]
Lets play around, with n of 10000 we create your size, make smaller to just test the code. After this we have the sample data to play with.
# simulate your sample sizes
n <- 10000
STORE_LIST_A <- dt[1:(3*n)]
STORE_LIST_B <- dt[(3*n+1):((3+5)*n)]
solution
This looks a bit redundant, but roughly 0.008 decimal difference equals about 1km, we are not interested in any distance calculation exceeding that. Note on the SF data I made it smaller as we would have too many combinations! I decided to use foverlaps
as post filtering on cartesian join did result in 2.5 bilion combinations, good luck :)
r <- 0.002 # set to 0.008 roughly for 1km
Note that in A we set the ranges, in B we "clone" the start/end. We basically split the data and do the latitudes and longitudes seperately, if either one of them is more than 1km apart, the distance between A and B will be too!
STORE_LIST_A[, LATITUDE_start := LATITUDE - r][, LATITUDE_end := LATITUDE + r]
STORE_LIST_A[, LONGITUDE_start := LONGITUDE - r][, LONGITUDE_end := LONGITUDE + r]
STORE_LIST_B[, LATITUDE_start := LATITUDE][, LATITUDE_end := LATITUDE]
STORE_LIST_B[, LONGITUDE_start := LONGITUDE][, LONGITUDE_end := LONGITUDE]
setkey(STORE_LIST_A, LATITUDE_start, LATITUDE_end)
setkey(STORE_LIST_B, LATITUDE_start, LATITUDE_end)
close_lat <- foverlaps(STORE_LIST_B, STORE_LIST_A, type = "within", nomatch = 0) # 62M matches
setkey(STORE_LIST_A, LONGITUDE_start, LONGITUDE_end)
setkey(STORE_LIST_B, LONGITUDE_start, LONGITUDE_end)
close_lon <- foverlaps(STORE_LIST_B, STORE_LIST_A, type = "within", nomatch = 0) # 53M matches
Now we have to inner join the two as the distance is only small enough if both distances between longitude and latitude are small enough. This below on this data resulted in "only" 5.3M matches for which you have to calculate the distance.
close_enough <- close_lat[close_lon, on = .(STORE_ID, i.STORE_ID), .(STORE_ID, LONGITUDE, LATITUDE, i.STORE_ID, i.LONGITUDE, i.LATITUDE), nomatch = 0]
Do some cleaning up and filter on the actually calculated distances, note that this is still pretty slow! But you know we only have to calculate this for 5.6M cases here and not for 1500M :)
close_enough[, distance := distHaversine(c(LONGITUDE, LATITUDE), c(i.LONGITUDE, i.LATITUDE)), .(STORE_ID, i.STORE_ID)][distance < 1000]
sample output (from a smaller test run)
STORE_ID LONGITUDE LATITUDE i.STORE_ID i.LONGITUDE i.LATITUDE distance
1: 0375071-01-001 -122.49 37.761 0465561-01-001 -122.49 37.761 51.190
2: 0925437-02-001 -122.49 37.781 1030867-06-151 -122.49 37.780 142.786
3: 0434463-01-001 -122.49 37.780 1030867-06-151 -122.49 37.780 136.233
4: 0460546-01-001 -122.48 37.782 0357002-01-001 -122.48 37.783 174.721
5: 0433580-02-001 -122.48 37.764 0412914-01-001 -122.48 37.763 91.625
---
500: 1020837-02-151 -122.39 37.762 1005253-08-141 -122.39 37.760 173.533
501: 0436520-02-001 -122.39 37.760 0371484-02-001 -122.39 37.760 0.000
502: 0474822-01-001 -122.39 37.760 0371484-02-001 -122.39 37.760 33.514
503: 1020837-02-151 -122.39 37.762 0371484-02-001 -122.39 37.760 173.533
504: 1022102-03-151 -122.39 37.740 0470578-01-001 -122.39 37.738 238.217