reventstimeidentify

Identify events within a time window in R


I need to identify a series (maximum 3 events) of events that occurred within 60 seconds.

Here there is the IN data

IN<-read.table(header = FALSE, text = "
2018-06-01_04:29:47
2018-06-01_05:44:41
2018-06-01_05:44:43
2018-06-01_05:44:45
2018-06-01_05:57:54
2018-06-01_05:57:56
2018-06-01_05:57:58
2018-06-01_08:10:35
2018-06-01_08:41:20
2018-06-01_08:41:22
2018-06-01_08:41:24
2018-06-01_08:52:01
2018-06-01_09:02:13
2018-06-01_09:22:45", quote="\n",col.names="time")

IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")

and here there is the desired output

OUT<-read.table(header = FALSE, text = "
2018-06-01_04:29:47        1
2018-06-01_05:44:41        1
2018-06-01_05:44:43        2
2018-06-01_05:44:45        3
2018-06-01_05:57:54        1
2018-06-01_05:57:56        2
2018-06-01_05:57:58        3
2018-06-01_08:10:35        1
2018-06-01_08:41:20        1
2018-06-01_08:41:22        2
2018-06-01_08:41:24        3
2018-06-01_08:52:01        1
2018-06-01_09:02:13        1
2018-06-01_09:22:45        1
",quote="\n",col.names=c("time","response"))

I have searched for similar questions, but unsuccessfully. I guess that function diff is the first step for solving this problem,

response<-as.numeric(diff(IN$time)>60)

but than I have no idea how to proceed to get the desired output.

Any helps will be appreciated.


Solution

  • Here's a solution using dplyr, magrittr, and lubridate packages.

    IN<-read.table(header = FALSE, text = "2018-06-01_04:29:47
                   2018-06-01_05:44:41
                   2018-06-01_05:44:43
                   2018-06-01_05:44:45
                   2018-06-01_05:57:54
                   2018-06-01_05:57:56
                   2018-06-01_05:57:58
                   2018-06-01_08:10:35
                   2018-06-01_08:41:20
                   2018-06-01_08:41:22
                   2018-06-01_08:41:24
                   2018-06-01_08:52:01
                   2018-06-01_09:02:13
                   2018-06-01_09:22:45", quote="\n",col.names="time")
    
    IN$time<-as.POSIXct(IN$time, "%Y-%m-%d_%H:%M:%S",tz="")
    

    I've removed the blank first line of the input data frame, as it caused problems. The following function filters the data frame to those elements within 60 seconds before the given ref_time and counts the number of rows using nrow.

    event_count <- function(ref_time){
      IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow
    }
    

    Here, I apply the function in a row-wise fashion, record the counts, and sort according to time. (Probably unnecessary...) The results are piped back in to the input data frame using the compound assignment pipe from magrittr.

    IN %<>% 
      rowwise() %>% 
      mutate(counts = event_count(time)) %>% 
      arrange(time)
    

    Finally, the results.

    # A tibble: 14 x 2
    #    time                counts
    #    <dttm>               <int>
    # 1  2018-06-01 04:29:47      1
    # 2  2018-06-01 05:44:41      1
    # 3  2018-06-01 05:44:43      2
    # 4  2018-06-01 05:44:45      3
    # 5  2018-06-01 05:57:54      1
    # 6  2018-06-01 05:57:56      2
    # 7  2018-06-01 05:57:58      3
    # 8  2018-06-01 08:10:35      1
    # 9  2018-06-01 08:41:20      1
    # 10 2018-06-01 08:41:22      2
    # 11 2018-06-01 08:41:24      3
    # 12 2018-06-01 08:52:01      1
    # 13 2018-06-01 09:02:13      1
    # 14 2018-06-01 09:22:45      1
    

    I think what @PoGibas is alluding to is for some reason there are two entries with the time 2018-06-01 05:57:54 in the input data frame. I'm not sure where the second comes from...


    EDIT: It's the new line in the read table that messes it up.

    EDIT²: This returns a maximum of 3...

    event_count <- function(ref_time){
      min(IN %>% filter(time %within% interval(ref_time - 60, ref_time)) %>% nrow, 3)
    }