rdatetimedplyr

Time difference in every n/two rows


I have this dataset

library(dplyr)

# creating a dataframe
data_frame <- data.frame(id = c(1,1,2,2,3,3),
                         col2 = c("start", "finish", "start", "finish","start", "finish"),
                         col3 =  c(as.POSIXct("2021-05-08 08:32:07"),
                                   as.POSIXct("2021-05-08 08:32:45"),
                                   as.POSIXct("2020-11-28 23:32:09"),
                                   as.POSIXct("2020-11-28 23:32:25"),
                                   as.POSIXct("2021-05-08 08:32:07"),
                                   as.POSIXct("2021-05-08 08:32:12")))

# computing difference from start to finish
df <- data_frame %>%
  arrange(id) %>%
  mutate(timetaken = col3 - lag(col3))

df

I want

  1. To calculate the time difference from start to finish.
  2. Average time difference

To calculate the time difference I can calculate the lag as show above. However, I want to have output in the following way and calculate the average time of that variable.

> df
  id   col2                col3      timetaken
1  1  start 2021-05-08 08:32:07        NA secs
2  1 finish 2021-05-08 08:32:45        38 secs
3  2  start 2020-11-28 23:32:09        NA secs
4  2 finish 2020-11-28 23:32:25        16 secs
5  3  start 2021-05-08 08:32:07        NA secs
6  3 finish 2021-05-08 08:32:12         5 secs

Any idea how to achieve this?


Solution

  • If you want to use dplyr::lag():

    library(dplyr)
    
    df <- data_frame |>
      mutate(timetaken = col3 - lag(col3), .by = id)
    
    df
    #   id   col2                col3 timetaken
    # 1  1  start 2021-05-08 08:32:07   NA secs
    # 2  1 finish 2021-05-08 08:32:45   38 secs
    # 3  2  start 2020-11-28 23:32:09   NA secs
    # 4  2 finish 2020-11-28 23:32:25   16 secs
    # 5  3  start 2021-05-08 08:32:07   NA secs
    # 6  3 finish 2021-05-08 08:32:12    5 secs
    

    and a dplyr() approach to get the mean (although, as in the previous answer, I find using base R cleaner in this instance):

    summarise(df, mean(timetaken, na.rm = TRUE))
    #   mean(timetaken, na.rm = TRUE)
    # 1                 19.66667 secs