performancedplyrgeospatialgeosphere

Calculating distance with geosphere::distGeo() within dyplr::mutate()


There is a data frame of about 240e6 rows with combinations of coordinate points between pseudonym and pseudonym2 (long, lat). It looks like the following:

geo <- data.frame(
  pseudonym = c("aa", "bb", "cc"),
  long_pseudonym = c(9.37, 7.37, 9.86),
  lat_pseudonym = c(52.22, 51.58, 53.48),
  pseudonym2 = c("ee", "ff", "gg"),
  long_pseudonym2 = c(9.69, 8.37, 9.24),
  lat_pseudonym2 = c(51.22, 53.58, 50.48),
  dist_km = NA_real_
)


geo

   pseudonym  long_pseudonym  lat_pseudonym  pseudonym2  long_pseudonym2  lat_pseudonym2  dist_km
1  aa         9.37            52.22          ee          9.69             51.22           NA
2  bb         7.37            51.58          ff          8.37             53.58           NA
3  cc         9.86            53.48          gg          9.24             50.48           NA

My first idea was to caluclate the column dist_km within a dplyr::mutate() with geosphere::distGeo()

geo <- geo %>%
  rowwise() %>%
  mutate(dist_km = distGEO(
    p1 = c(long_pseudonym, lat_pseudonym),
    p2 = c(long_pseudonym2, lat_pseudonym2)) / 1000
  ) %>%
  ungroup()


geo

   pseudonym  long_pseudonym  lat_pseudonym  pseudonym2  long_pseudonym2  lat_pseudonym2  dist_km
1  aa         9.37            52.22          ee          9.69             51.22           113.
2  bb         7.37            51.58          ff          8.37             53.58           233.
3  cc         9.86            53.48          gg          9.24             50.48           337.

So far, so good.

For small tables this approach is working but for big tables (240 million rows) it is very time consuming.

Does anyone has an idea to do it better? Thanks.


Solution

  • I changed the code to

    geo <- geo %>%
      mutate(dist_km = distGEO(
        p1 = cbind(long_pseudonym, lat_pseudonym),
        p2 = cbind(long_pseudonym2, lat_pseudonym2)) / 1000
      ) 
    

    and it works very well. For 240 million rows it took less than 8 minutes. The trick is the cbind command before distGeo.