I notice this operation is very time consuming for seemingly simple calculation. It probably explains more than 60% out of all duration to complete the current R script.
The actual data contains about 500,000 rows
with about 100,000 unique ids
(column for group by).
And the type of column used for calculating the lag difference
is of lubridate::dmy_hms
. Therefore the lag difference is looking at seconds
time difference.
I think these may play a role in why it takes so long to compute but I'm also curious whether I can rewrite the following code differently or maybe just use R base
if it's way faster (maybe at the expense of doing some data type gymnastic to ensure other dplyr
-like operations remain intact)
dief <- tibble(id = rep(letters,4), time = 1:length(id) + sample(1:30, length(id),replace = T))
dief %>%
arrange(id) %>%
group_by(id) %>%
mutate(time_difference = (time - lag(time)))
Two options are dtplyr and tidytable. Personally I'm using the latter. Both leverage on data.table but uses dplyr-style code.
dief <- tibble(id = rep(letters,4), time = 1:length(id) + sample(1:30, length(id),replace = T))
dief %>%
arrange.(id) %>%
mutate.(time_difference = (time - lags.(time)),
.by = id)
Here are some speed comparisons.