rdataframedplyrdata.tablenon-equi-join

Merge 2 dataframes using conditions on "hour" and "min" of df1 in datetimes of df2


I have a dataframe df.sample like this

id <- c("A","A","A","A","A","A","A","A","A","A","A")
date <- c("2018-11-12","2018-11-12","2018-11-12","2018-11-12","2018-11-12",
          "2018-11-12","2018-11-12","2018-11-14","2018-11-14","2018-11-14",
          "2018-11-12")
hour <- c(8,8,9,9,13,13,16,6,7,19,7)
min <- c(47,59,6,18,22,36,12,32,12,21,47)
value <- c(70,70,86,86,86,74,81,77,79,83,91)
df.sample <- data.frame(id,date,hour,min,value,stringsAsFactors = F) 
df.sample$date <- as.Date(df.sample$date,format="%Y-%m-%d")

I have another data frame df.state like this

id <- c("A","A","A")
starttime <- c("2018-11-12 08:59:00","2018-11-14 06:24:17","2018-11-15 09:17:00")
endtime <- c("2018-11-12 15:57:00","2018-11-14 17:22:16","2018-11-15 12:17:32")
state <- c("Pass","Pass","Pass")

df.state <- data.frame(id,starttime,endtime,state,stringsAsFactors = F) 
df.state$starttime <- as.POSIXct(df.state$starttime,format="%Y-%m-%d %H:%M:%S")
df.state$endtime <- as.POSIXct(df.state$endtime,format="%Y-%m-%d %H:%M:%S")

I am trying to merge these 2 data frames based on a condition

if the hour and min in df.sample is within the starttime and endtime of df.state, then merge state = Pass in the df.sample.

For example, the row 2 in df.sample has hour = 8, min = 59 and since it is within the starttime = 2018-11-12 08:59:00 in df.state, the value Pass is added

Here is my desired output

   id       date hour min value state
    A 2018-11-12    8  47    70      
    A 2018-11-12    8  59    70  Pass
    A 2018-11-12    9   6    86  Pass
    A 2018-11-12    9  18    86  Pass
    A 2018-11-12   13  22    86  Pass
    A 2018-11-12   13  36    74  Pass
    A 2018-11-12   16  12    81      
    A 2018-11-14    6  32    77  Pass
    A 2018-11-14    7  12    79  Pass
    A 2018-11-14   19  21    83      
    A 2018-11-12    7  47    91      

I am able to merge these 2 dataframes like this but not able to look up hour and min of df.sample in the starttime and endtime of df.state

library(tidyverse)
df.sample <- df.sample %>%
  left_join(df.state)

Can someone point me in the right direction


Solution

  • Using non-equi join from data.table package is much faster and easier if you happen to have big data frames: Benchmark | Video

    library(data.table)
    
    ## convert both data.frames to data.tables by reference
    setDT(df.sample)
    setDT(df.state) 
    
    ## create a `time` column in df.sample 
    df.sample[, time := as.POSIXct(paste0(date, " ", hour, ":", min, ":00"))]
    ## change column order
    setcolorder(df.sample, c("id", "time"))
    
    # join by id and time within start & end time limits
    # "x." is used so we can refer to the column in other data.table explicitly
    df.state[df.sample, .(id, time, date, hour, min, value, state = x.state), 
             on = .(id, starttime <= time, endtime >= time)]
    #>     id                time       date hour min value state
    #>  1:  A 2018-11-12 08:47:00 2018-11-12    8  47    70  <NA>
    #>  2:  A 2018-11-12 08:59:00 2018-11-12    8  59    70  Pass
    #>  3:  A 2018-11-12 09:06:00 2018-11-12    9   6    86  Pass
    #>  4:  A 2018-11-12 09:18:00 2018-11-12    9  18    86  Pass
    #>  5:  A 2018-11-12 13:22:00 2018-11-12   13  22    86  Pass
    #>  6:  A 2018-11-12 13:36:00 2018-11-12   13  36    74  Pass
    #>  7:  A 2018-11-12 16:12:00 2018-11-12   16  12    81  <NA>
    #>  8:  A 2018-11-14 06:32:00 2018-11-14    6  32    77  Pass
    #>  9:  A 2018-11-14 07:12:00 2018-11-14    7  12    79  Pass
    #> 10:  A 2018-11-14 19:21:00 2018-11-14   19  21    83  <NA>
    #> 11:  A 2018-11-12 07:47:00 2018-11-12    7  47    91  <NA>
    
    ### remove NA
    df.state[df.sample, .(id, time, date, hour, min, value, state = x.state), 
             on = .(id, starttime <= time, endtime >= time), nomatch = 0L]
    #>    id                time       date hour min value state
    #> 1:  A 2018-11-12 08:59:00 2018-11-12    8  59    70  Pass
    #> 2:  A 2018-11-12 09:06:00 2018-11-12    9   6    86  Pass
    #> 3:  A 2018-11-12 09:18:00 2018-11-12    9  18    86  Pass
    #> 4:  A 2018-11-12 13:22:00 2018-11-12   13  22    86  Pass
    #> 5:  A 2018-11-12 13:36:00 2018-11-12   13  36    74  Pass
    #> 6:  A 2018-11-14 06:32:00 2018-11-14    6  32    77  Pass
    #> 7:  A 2018-11-14 07:12:00 2018-11-14    7  12    79  Pass
    

    Created on 2019-05-23 by the reprex package (v0.3.0)