There is a data frame of about 240e6 rows with combinations of coordinate points between pseudonym and pseudonym2 (long, lat). It looks like the following:
geo <- data.frame(
pseudonym = c("aa", "bb", "cc"),
long_pseudonym = c(9.37, 7.37, 9.86),
lat_pseudonym = c(52.22, 51.58, 53.48),
pseudonym2 = c("ee", "ff", "gg"),
long_pseudonym2 = c(9.69, 8.37, 9.24),
lat_pseudonym2 = c(51.22, 53.58, 50.48),
dist_km = NA_real_
)
geo
pseudonym long_pseudonym lat_pseudonym pseudonym2 long_pseudonym2 lat_pseudonym2 dist_km
1 aa 9.37 52.22 ee 9.69 51.22 NA
2 bb 7.37 51.58 ff 8.37 53.58 NA
3 cc 9.86 53.48 gg 9.24 50.48 NA
My first idea was to caluclate the column dist_km within a dplyr::mutate() with geosphere::distGeo()
geo <- geo %>%
rowwise() %>%
mutate(dist_km = distGEO(
p1 = c(long_pseudonym, lat_pseudonym),
p2 = c(long_pseudonym2, lat_pseudonym2)) / 1000
) %>%
ungroup()
geo
pseudonym long_pseudonym lat_pseudonym pseudonym2 long_pseudonym2 lat_pseudonym2 dist_km
1 aa 9.37 52.22 ee 9.69 51.22 113.
2 bb 7.37 51.58 ff 8.37 53.58 233.
3 cc 9.86 53.48 gg 9.24 50.48 337.
So far, so good.
For small tables this approach is working but for big tables (240 million rows) it is very time consuming.
Does anyone has an idea to do it better? Thanks.
I changed the code to
geo <- geo %>%
mutate(dist_km = distGEO(
p1 = cbind(long_pseudonym, lat_pseudonym),
p2 = cbind(long_pseudonym2, lat_pseudonym2)) / 1000
)
and it works very well. For 240 million rows it took less than 8 minutes. The trick is the cbind command before distGeo.