rmergedatatableleft-joinfuzzyjoin

How to merge two datasets based on different time interval conditions?


I have two different datasets and would like to join Dataset2 to Dataset1. In Dataset1, there are different CBs in each day, and there are IV 1 to 3 in each CB. In Dataset2, there are different time intervals, for example every 1 hour, 15 min, or 5 minute. I am looking to add conditions in left_join or fuzzy_left_join or any other way to join these two datasets as below,

Sorry I have tried so hard to simplify this but please let me know if I need to explain it better. Thanks

Dataset1:

    Exp            DateTime IV CB Value
 1:   1 2020-01-01 10:00:00  1  1    22
 2:   1 2020-01-01 10:10:00  2  1    21
 3:   1 2020-01-01 10:20:00  3  1    33
 4:   1 2020-01-01 17:00:00  1  8    21
 5:   1 2020-01-01 17:20:00  3  8    78
 6:   1 2020-01-01 17:30:00  1  9    NA
 7:   1 2020-01-02 09:00:00  1  1    54
 8:   1 2020-01-02 09:10:00  2  1    43
 9:   1 2020-01-02 09:20:00  3  1    33
10:   1 2020-01-02 10:01:00  1  3    21
11:   1 2020-01-02 10:11:00  2  3     9
12:   1 2020-01-02 10:21:00  3  3    87
13:   2 2020-04-10 00:00:00  1  1    67
14:   2 2020-04-10 00:10:00  2  1    75
15:   2 2020-04-10 00:20:00  3  1    32
16:   2 2020-04-10 18:00:00  1 11    12
17:   2 2020-04-10 18:10:00  2 11   345
18:   2 2020-04-10 18:20:00  3 11    32
19:   2 2020-04-11 09:00:00  1  2    12
20:   2 2020-04-11 09:10:00  2  2    98
21:   2 2020-04-11 09:20:00  3  2    90
22:   2 2020-04-11 02:01:00  1  7    76
23:   2 2020-04-11 02:11:00  2  7    33
24:   2 2020-04-11 02:21:00  3  7    11  

Dataset2:

             Timestamp Value_y
 1: 2020-01-01 10:01:00      33
 2: 2020-01-01 11:01:00      12
 3: 2020-01-01 12:01:00      56
 4: 2020-01-01 13:01:00      32
 5: 2020-01-01 14:01:00      16
 6: 2020-01-01 15:01:00      57
 7: 2020-01-01 16:01:00      53
 8: 2020-01-01 17:01:00      21
 9: 2020-01-01 18:01:00      NA
10: 2020-01-01 19:01:00      55
11: 2020-01-01 20:00:59      53
12: 2020-01-01 21:00:59      24
13: 2020-01-02 09:00:59      33
14: 2020-01-02 10:00:59      22
15: 2020-01-02 11:00:59      22
16: 2020-01-02 12:00:59      35
17: 2020-01-02 13:00:59      35
18: 2020-01-02 14:00:59      67
19: 2020-01-02 15:00:59      23
20: 2020-01-02 16:00:59      46
21: 2020-01-02 17:00:59      36
22: 2020-01-02 18:00:59      32
23: 2020-04-10 00:00:00       4
24: 2020-04-10 00:15:00       3
25: 2020-04-10 00:29:59       4
26: 2020-04-10 00:44:59       5
27: 2020-04-10 00:59:59       6
28: 2020-04-10 01:14:59     777
29: 2020-04-10 01:29:59     554
30: 2020-04-10 17:49:59      22
31: 2020-04-10 17:54:59      11
32: 2020-04-10 17:59:59      NA
33: 2020-04-10 18:04:59      87
34: 2020-04-10 18:09:59      65
35: 2020-04-10 18:14:59     334
36: 2020-04-10 18:19:59     554
37: 2020-04-10 18:24:59      33
38: 2020-04-10 18:29:59      NA
39: 2020-04-10 18:34:59      NA
40: 2020-04-10 18:39:59      NA
41: 2020-04-10 18:44:59      NA
42: 2020-04-10 18:49:59      NA
43: 2020-04-10 18:54:59      NA
44: 2020-04-11 01:40:17      33
45: 2020-04-11 01:55:17      NA
46: 2020-04-11 02:10:17      NA
47: 2020-04-11 02:25:17      NA
48: 2020-04-11 02:40:17      NA
49: 2020-04-11 02:55:17      33
50: 2020-04-11 03:10:17      21
51: 2020-04-11 03:25:17      34
52: 2020-04-11 03:40:17      67
53: 2020-04-11 03:55:17      54
54: 2020-04-11 04:10:17     668
55: 2020-04-11 04:25:17      44
56: 2020-04-11 08:40:14      22
57: 2020-04-11 08:55:14     345
58: 2020-04-11 09:10:14      NA
59: 2020-04-11 09:25:15     555
60: 2020-04-11 09:40:15      NA
61: 2020-04-11 09:55:15      NA

Note: Just wanted to add that each day devided into several CBs. In each CB, I have three IVs. But some of the IVs could be removed due to outlier removal. For example, CB 9 in 2020-01-01 has only one IV.

Edit: The output of dput(Dataset1) and dput(Dataset2):

Dataset1:

structure(list(Exp = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), DateTime = structure(c(1577872800, 
1577873400, 1577874000, 1577898000, 1577899200, 1577899800, 1577955600, 
1577956200, 1577956800, 1577959260, 1577959860, 1577960460, 1586476800, 
1586477400, 1586478000, 1586541600, 1586542200, 1586542800, 1586595600, 
1586596200, 1586596800, 1586570460, 1586571060, 1586571660), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), IV = c(1, 2, 3, 1, 3, 1, 1, 2, 3, 
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3), CB = c(1, 1, 1, 
8, 8, 9, 1, 1, 1, 3, 3, 3, 1, 1, 1, 11, 11, 11, 2, 2, 2, 7, 7, 
7), Value = c(22, 21, 33, 21, 78, NA, 54, 43, 33, 21, 9, 87, 
67, 75, 32, 12, 345, 32, 12, 98, 90, 76, 33, 11)), row.names = c(NA, 
-24L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001ceefb31fd0>)

Dataset2:

structure(list(Timestamp = structure(c(1577872860, 1577876460, 
1577880060, 1577883660, 1577887260, 1577890860, 1577894460, 1577898060, 
1577901660, 1577905260, 1577908859.995, 1577912459.995, 1577955659.995, 
1577959259.995, 1577962859.995, 1577966459.995, 1577970059.995, 
1577973659.995, 1577977259.995, 1577980859.995, 1577984459.995, 
1577988059.995, 1586476800, 1586477700, 1586478599.99, 1586479499.985, 
1586480399.98, 1586481299.975, 1586482199.97, 1586540999.905, 
1586541299.9, 1586541599.895, 1586541899.89, 1586542199.885, 
1586542499.88, 1586542799.875, 1586543099.87, 1586543399.865, 
1586543699.86, 1586543999.855, 1586544299.85, 1586544599.845, 
1586544899.84, 1586569217, 1586570117.005, 1586571017.01, 1586571917.015, 
1586572817.02, 1586573717.025, 1586574617.03, 1586575517.035, 
1586576417.04, 1586577317.045, 1586578217.05, 1586579117.055, 
1586594414.275, 1586595314.615, 1586596214.955, 1586597115.295, 
1586598015.635, 1586598915.975), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Value_y = c(33, 12, 56, 32, 16, 57, 53, 21, 
NA, 55, 53, 24, 33, 22, 22, 35, 35, 67, 23, 46, 36, 32, 4, 3, 
4, 5, 6, 777, 554, 22, 11, NA, 87, 65, 334, 554, 33, NA, NA, 
NA, NA, NA, NA, 33, NA, NA, NA, NA, 33, 21, 34, 67, 54, 668, 
44, 22, 345, NA, 555, NA, NA)), row.names = c(NA, -61L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x000001ceefb31fd0>)

Solution

  • Note:

    Code:

    Dataset_merged <- Dataset1 %>%
      rowwise() %>%
      group_by(CB) %>%   
      mutate(
        avg_values = list(
          Dataset2 %>%
            filter(Timestamp >= min(DateTime) & Timestamp < (max(DateTime))) %>%
            summarise(across(everything(), ~ mean(.x, na.rm = TRUE))))) %>%
      unnest_wider(avg_values) %>%  
      ungroup() %>%
      data.table()