rdplyrdata-transform

Speeding up or alternatives to group by and lag in dplyr


I notice this operation is very time consuming for seemingly simple calculation. It probably explains more than 60% out of all duration to complete the current R script.

The actual data contains about 500,000 rows with about 100,000 unique ids (column for group by).

And the type of column used for calculating the lag difference is of lubridate::dmy_hms. Therefore the lag difference is looking at seconds time difference.

I think these may play a role in why it takes so long to compute but I'm also curious whether I can rewrite the following code differently or maybe just use R base if it's way faster (maybe at the expense of doing some data type gymnastic to ensure other dplyr-like operations remain intact)

dief <- tibble(id = rep(letters,4), time = 1:length(id) + sample(1:30, length(id),replace = T))
dief %>%
  arrange(id) %>%
  group_by(id) %>% 
  mutate(time_difference = (time - lag(time)))

Solution

  • Two options are dtplyr and tidytable. Personally I'm using the latter. Both leverage on data.table but uses dplyr-style code.

    dief <- tibble(id = rep(letters,4), time = 1:length(id) + sample(1:30, length(id),replace = T))
    dief %>%
      arrange.(id) %>%
      mutate.(time_difference = (time - lags.(time)),
              .by = id)
    

    Here are some speed comparisons.