I have two different datasets and would like to join Dataset2
to Dataset1
.
In Dataset1
, there are different CB
s in each day, and there are IV
1 to 3 in each CB
.
In Dataset2
, there are different time intervals, for example every 1 hour, 15 min, or 5 minute.
I am looking to add conditions in left_join
or fuzzy_left_join
or any other way to join these two datasets as below,
Dataset2
is 1 hour Then add Value_y
to each row of related CB
in Dataset1
as is.Dataset2
is 15min or 5min, then based on the duration of each CB
add the average of Value_y
. So, if the duration of CB1
in 2020-01-01
is 20 minutes, then add average of Value_y
from those 5-minute Value_y
which are fallen into the CB1
time range.Sorry I have tried so hard to simplify this but please let me know if I need to explain it better. Thanks
Dataset1:
Exp DateTime IV CB Value
1: 1 2020-01-01 10:00:00 1 1 22
2: 1 2020-01-01 10:10:00 2 1 21
3: 1 2020-01-01 10:20:00 3 1 33
4: 1 2020-01-01 17:00:00 1 8 21
5: 1 2020-01-01 17:20:00 3 8 78
6: 1 2020-01-01 17:30:00 1 9 NA
7: 1 2020-01-02 09:00:00 1 1 54
8: 1 2020-01-02 09:10:00 2 1 43
9: 1 2020-01-02 09:20:00 3 1 33
10: 1 2020-01-02 10:01:00 1 3 21
11: 1 2020-01-02 10:11:00 2 3 9
12: 1 2020-01-02 10:21:00 3 3 87
13: 2 2020-04-10 00:00:00 1 1 67
14: 2 2020-04-10 00:10:00 2 1 75
15: 2 2020-04-10 00:20:00 3 1 32
16: 2 2020-04-10 18:00:00 1 11 12
17: 2 2020-04-10 18:10:00 2 11 345
18: 2 2020-04-10 18:20:00 3 11 32
19: 2 2020-04-11 09:00:00 1 2 12
20: 2 2020-04-11 09:10:00 2 2 98
21: 2 2020-04-11 09:20:00 3 2 90
22: 2 2020-04-11 02:01:00 1 7 76
23: 2 2020-04-11 02:11:00 2 7 33
24: 2 2020-04-11 02:21:00 3 7 11
Dataset2:
Timestamp Value_y
1: 2020-01-01 10:01:00 33
2: 2020-01-01 11:01:00 12
3: 2020-01-01 12:01:00 56
4: 2020-01-01 13:01:00 32
5: 2020-01-01 14:01:00 16
6: 2020-01-01 15:01:00 57
7: 2020-01-01 16:01:00 53
8: 2020-01-01 17:01:00 21
9: 2020-01-01 18:01:00 NA
10: 2020-01-01 19:01:00 55
11: 2020-01-01 20:00:59 53
12: 2020-01-01 21:00:59 24
13: 2020-01-02 09:00:59 33
14: 2020-01-02 10:00:59 22
15: 2020-01-02 11:00:59 22
16: 2020-01-02 12:00:59 35
17: 2020-01-02 13:00:59 35
18: 2020-01-02 14:00:59 67
19: 2020-01-02 15:00:59 23
20: 2020-01-02 16:00:59 46
21: 2020-01-02 17:00:59 36
22: 2020-01-02 18:00:59 32
23: 2020-04-10 00:00:00 4
24: 2020-04-10 00:15:00 3
25: 2020-04-10 00:29:59 4
26: 2020-04-10 00:44:59 5
27: 2020-04-10 00:59:59 6
28: 2020-04-10 01:14:59 777
29: 2020-04-10 01:29:59 554
30: 2020-04-10 17:49:59 22
31: 2020-04-10 17:54:59 11
32: 2020-04-10 17:59:59 NA
33: 2020-04-10 18:04:59 87
34: 2020-04-10 18:09:59 65
35: 2020-04-10 18:14:59 334
36: 2020-04-10 18:19:59 554
37: 2020-04-10 18:24:59 33
38: 2020-04-10 18:29:59 NA
39: 2020-04-10 18:34:59 NA
40: 2020-04-10 18:39:59 NA
41: 2020-04-10 18:44:59 NA
42: 2020-04-10 18:49:59 NA
43: 2020-04-10 18:54:59 NA
44: 2020-04-11 01:40:17 33
45: 2020-04-11 01:55:17 NA
46: 2020-04-11 02:10:17 NA
47: 2020-04-11 02:25:17 NA
48: 2020-04-11 02:40:17 NA
49: 2020-04-11 02:55:17 33
50: 2020-04-11 03:10:17 21
51: 2020-04-11 03:25:17 34
52: 2020-04-11 03:40:17 67
53: 2020-04-11 03:55:17 54
54: 2020-04-11 04:10:17 668
55: 2020-04-11 04:25:17 44
56: 2020-04-11 08:40:14 22
57: 2020-04-11 08:55:14 345
58: 2020-04-11 09:10:14 NA
59: 2020-04-11 09:25:15 555
60: 2020-04-11 09:40:15 NA
61: 2020-04-11 09:55:15 NA
Note:
Just wanted to add that each day devided into several CB
s. In each CB
, I have three IV
s. But some of the IV
s could be removed due to outlier removal. For example, CB
9
in 2020-01-01
has only one IV
.
Edit:
The output of dput(Dataset1)
and dput(Dataset2)
:
Dataset1:
structure(list(Exp = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), DateTime = structure(c(1577872800,
1577873400, 1577874000, 1577898000, 1577899200, 1577899800, 1577955600,
1577956200, 1577956800, 1577959260, 1577959860, 1577960460, 1586476800,
1586477400, 1586478000, 1586541600, 1586542200, 1586542800, 1586595600,
1586596200, 1586596800, 1586570460, 1586571060, 1586571660), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), IV = c(1, 2, 3, 1, 3, 1, 1, 2, 3,
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3), CB = c(1, 1, 1,
8, 8, 9, 1, 1, 1, 3, 3, 3, 1, 1, 1, 11, 11, 11, 2, 2, 2, 7, 7,
7), Value = c(22, 21, 33, 21, 78, NA, 54, 43, 33, 21, 9, 87,
67, 75, 32, 12, 345, 32, 12, 98, 90, 76, 33, 11)), row.names = c(NA,
-24L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001ceefb31fd0>)
Dataset2:
structure(list(Timestamp = structure(c(1577872860, 1577876460,
1577880060, 1577883660, 1577887260, 1577890860, 1577894460, 1577898060,
1577901660, 1577905260, 1577908859.995, 1577912459.995, 1577955659.995,
1577959259.995, 1577962859.995, 1577966459.995, 1577970059.995,
1577973659.995, 1577977259.995, 1577980859.995, 1577984459.995,
1577988059.995, 1586476800, 1586477700, 1586478599.99, 1586479499.985,
1586480399.98, 1586481299.975, 1586482199.97, 1586540999.905,
1586541299.9, 1586541599.895, 1586541899.89, 1586542199.885,
1586542499.88, 1586542799.875, 1586543099.87, 1586543399.865,
1586543699.86, 1586543999.855, 1586544299.85, 1586544599.845,
1586544899.84, 1586569217, 1586570117.005, 1586571017.01, 1586571917.015,
1586572817.02, 1586573717.025, 1586574617.03, 1586575517.035,
1586576417.04, 1586577317.045, 1586578217.05, 1586579117.055,
1586594414.275, 1586595314.615, 1586596214.955, 1586597115.295,
1586598015.635, 1586598915.975), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Value_y = c(33, 12, 56, 32, 16, 57, 53, 21,
NA, 55, 53, 24, 33, 22, 22, 35, 35, 67, 23, 46, 36, 32, 4, 3,
4, 5, 6, 777, 554, 22, 11, NA, 87, 65, 334, 554, 33, NA, NA,
NA, NA, NA, NA, 33, NA, NA, NA, NA, 33, 21, 34, 67, 54, 668,
44, 22, 345, NA, 555, NA, NA)), row.names = c(NA, -61L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x000001ceefb31fd0>)
Note:
Dataset1
by CB
, then filtered the datasets by a condition; with this condition, I can find the rows of Dataset2
that are in the range of start (min()
) and end time (max()
) of each CB
in Dataset2
. Therefore, if the interval in Dataset2
is 1 hour, I will have a mean
of only one row, and in a case the interval is 15 minutes then I will have a mean
of 4 or 5 rows.Dataset1
by other columns as well as CB
, which can be added into group_by()
.Dataset2
which I used list(avg-values)
, and then did unnest
them, but this can be easily modified for only one column, like Value_y
of Dataset2
in the example provided for the question.Code:
Dataset_merged <- Dataset1 %>%
rowwise() %>%
group_by(CB) %>%
mutate(
avg_values = list(
Dataset2 %>%
filter(Timestamp >= min(DateTime) & Timestamp < (max(DateTime))) %>%
summarise(across(everything(), ~ mean(.x, na.rm = TRUE))))) %>%
unnest_wider(avg_values) %>%
ungroup() %>%
data.table()