rmergeleft-joinfuzzy-search

R: fuzyyjoin, or other, to merge accelerometer and magneterometer data


I have two datasets, one containing accelerometer data and another magnetometer data. I want to merge both datasets but the values are not measured at the same milliseconds:

ACC data:

head(ACCL_20220217_01_)
                 time  ms ms_to_on_acc x_acc y_acc z_acc
1 2022-02-17 01:13:20   0    551185497 12800  4480  1600
2 2022-02-17 01:13:20  20    551185517 12096  3200  2048
3 2022-02-17 01:13:20  41    551185538 11776  3008  1856
4 2022-02-17 01:13:20  61    551185558 15872  5952  1408
5 2022-02-17 01:13:20  81    551185578 17216  9856  1664
6 2022-02-17 01:13:20 102    551185599 20096 14464  2176

MAG data

head(MAG_20220217_01_)
                 time  ms ms_to_on_mag x_mag y_mag z_mag
1 2022-02-17 01:13:20   0    551175445  -305   735    24
2 2022-02-17 01:13:20  96    551175541  -308   765    63
3 2022-02-17 01:13:20 192    551175637  -301   752    24
4 2022-02-17 01:13:20 289    551175734  -303   762    27
5 2022-02-17 01:13:20 385    551175830  -300   746    49
6 2022-02-17 01:13:20 481    551175926  -317   738    90

What I want is to merge both datasets by the closest ms value, to have both measurements at the highest temporal matching as possible. So, my output dataset would look like this:

    time                ms_acc  x_acc   y_acc   z_acc   ms_mag  x_mag   y_mag   z_mag
1   2022-02-17 01:13    0       12800   4480    1600    0       -305    735     24
2   2022-02-17 01:13    20      12096   3200    2048    0       -305    735     24
3   2022-02-17 01:13    41      11776   3008    1856    0       -305    735     24
4   2022-02-17 01:13    61      15872   5952    1408    96      -308    765     63
5   2022-02-17 01:13    81      17216   9856    1664    96      -308    765     63
6   2022-02-17 01:13    102     20096   14464   2176    96      -308    765     63

And so on.

I'm aware of the fuzzyjoin functions, and I also followed some other examples I found in stackoverflow, but none seem to work properly:

1)

 fuzzy_left_join(
  MAG_20220217_01_, ACCL_20220217_01_,
  by = c(
    "time" = "time",
    "ms" = "ms"), match_fun = list(`==`, `>=`, `<=`))

Gives this error, which seems to be connected to the length of the datasets: Error: Length of match_fun not equal to columns specified in 'by'.

2)

temp <- difference_left_join(ACCL_20220217_01_, MAG_20220217_01_, by= c("time", "ms"), max_dist = 100)

This is quite close but still provides a different output from what I was expecting; as it duplicates the rows:

              time.x ms.x ms_to_on_acc x_acc y_acc z_acc              time.y ms.y ms_to_on_mag
1 2022-02-17 01:13:20    0    551185497 12800  4480  1600 2022-02-17 01:13:20    0    551175445
2 2022-02-17 01:13:20    0    551185497 12800  4480  1600 2022-02-17 01:13:20   96    551175541
3 2022-02-17 01:13:20   20    551185517 12096  3200  2048 2022-02-17 01:13:20    0    551175445
4 2022-02-17 01:13:20   20    551185517 12096  3200  2048 2022-02-17 01:13:20   96    551175541
5 2022-02-17 01:13:20   41    551185538 11776  3008  1856 2022-02-17 01:13:20    0    551175445
6 2022-02-17 01:13:20   41    551185538 11776  3008  1856 2022-02-17 01:13:20   96    551175541
  x_mag y_mag z_mag
1  -305   735    24
2  -308   765    63
3  -305   735    24
4  -308   765    63
5  -305   735    24 
6  -308   765    63

Any other solutions? Thanks in advance!

Here's the datasets from above:

dput(ACCL_20220217_01_)
structure(list(time = structure(c(1645056800, 1645056800.02, 
1645056800.041, 1645056800.061, 1645056800.081, 1645056800.102, 
1645056800.122, 1645056800.143, 1645056800.163, 1645056800.184, 
1645056800.204, 1645056800.225, 1645056800.245, 1645056800.265, 
1645056800.286, 1645056800.306, 1645056800.327, 1645056800.347, 
1645056800.368, 1645056800.388, 1645056800.408, 1645056800.429, 
1645056800.449, 1645056800.47, 1645056800.49, 1645056800.511, 
1645056800.531, 1645056800.552, 1645056800.572, 1645056800.592, 
1645056800.613, 1645056800.633, 1645056800.654, 1645056800.674, 
1645056800.695, 1645056800.715, 1645056800.736, 1645056800.756, 
1645056800.776, 1645056800.797, 1645056800.817, 1645056800.838, 
1645056800.858, 1645056800.879, 1645056800.899, 1645056800.92, 
1645056800.94, 1645056800.96, 1645056800.981, 1645056801.001, 
1645056801.022, 1645056801.042, 1645056801.063, 1645056801.083, 
1645056801.104, 1645056801.124, 1645056801.144, 1645056801.165, 
1645056801.185, 1645056801.206, 1645056801.226, 1645056801.247, 
1645056801.267, 1645056801.288, 1645056801.308, 1645056801.328, 
1645056801.349, 1645056801.369, 1645056801.39, 1645056801.41, 
1645056801.431, 1645056801.451, 1645056801.472, 1645056801.492, 
1645056801.512, 1645056801.533, 1645056801.553, 1645056801.574, 
1645056801.594, 1645056801.615, 1645056801.635, 1645056801.656, 
1645056801.676, 1645056801.696, 1645056801.717, 1645056801.737, 
1645056801.758, 1645056801.778, 1645056801.799, 1645056801.819, 
1645056801.84, 1645056801.86, 1645056801.88, 1645056801.901, 
1645056801.921, 1645056801.942, 1645056801.962, 1645056801.983, 
1645056802.003, 1645056802.024, 1645056802.044, 1645056802.064, 
1645056802.085, 1645056802.105, 1645056802.126, 1645056802.146, 
1645056802.167, 1645056802.187, 1645056802.207, 1645056802.228, 
1645056802.248, 1645056802.269, 1645056802.289, 1645056802.31, 
1645056802.33, 1645056802.351, 1645056802.371, 1645056802.391, 
1645056802.412, 1645056802.432, 1645056802.453, 1645056802.473, 
1645056802.494, 1645056802.514, 1645056802.534, 1645056802.555, 
1645056802.575, 1645056802.596, 1645056802.616, 1645056802.637, 
1645056802.657, 1645056802.677, 1645056802.698, 1645056802.718, 
1645056802.739, 1645056802.759, 1645056802.78, 1645056802.8, 
1645056802.821, 1645056802.841, 1645056802.861, 1645056802.882, 
1645056802.902, 1645056802.923, 1645056802.943, 1645056802.964, 
1645056802.984, 1645056803.004, 1645056803.025, 1645056803.045, 
1645056803.066, 1645056803.086, 1645056803.107, 1645056803.127, 
1645056803.148, 1645056803.168, 1645056803.188, 1645056803.209, 
1645056803.229, 1645056803.25, 1645056803.27, 1645056803.291, 
1645056803.311, 1645056803.332, 1645056803.352, 1645056803.372, 
1645056803.393, 1645056803.413, 1645056803.434, 1645056803.454, 
1645056803.475, 1645056803.495, 1645056803.516, 1645056803.536, 
1645056803.556, 1645056803.577, 1645056803.597, 1645056803.618, 
1645056803.638, 1645056803.659, 1645056803.679, 1645056803.7, 
1645056803.72, 1645056803.74, 1645056803.761, 1645056803.781, 
1645056803.802, 1645056803.822, 1645056803.843, 1645056803.863, 
1645056803.884, 1645056803.904, 1645056803.924, 1645056803.945, 
1645056803.965, 1645056803.986), class = c("POSIXct", "POSIXt"
), tzone = "Europe/Paris"), ms = c(0, 20, 41, 61, 81, 102, 122, 
143, 163, 184, 204, 225, 245, 265, 286, 306, 327, 347, 368, 388, 
408, 429, 449, 470, 490, 511, 531, 552, 572, 592, 613, 633, 654, 
674, 695, 715, 736, 756, 776, 797, 817, 838, 858, 879, 899, 920, 
940, 960, 981, 1001, 1022, 1042, 1063, 1083, 1104, 1124, 1144, 
1165, 1185, 1206, 1226, 1247, 1267, 1288, 1308, 1328, 1349, 1369, 
1390, 1410, 1431, 1451, 1472, 1492, 1512, 1533, 1553, 1574, 1594, 
1615, 1635, 1656, 1676, 1696, 1717, 1737, 1758, 1778, 1799, 1819, 
1840, 1860, 1880, 1901, 1921, 1942, 1962, 1983, 2003, 2024, 2044, 
2064, 2085, 2105, 2126, 2146, 2167, 2187, 2207, 2228, 2248, 2269, 
2289, 2310, 2330, 2351, 2371, 2391, 2412, 2432, 2453, 2473, 2494, 
2514, 2534, 2555, 2575, 2596, 2616, 2637, 2657, 2677, 2698, 2718, 
2739, 2759, 2780, 2800, 2821, 2841, 2861, 2882, 2902, 2923, 2943, 
2964, 2984, 3004, 3025, 3045, 3066, 3086, 3107, 3127, 3148, 3168, 
3188, 3209, 3229, 3250, 3270, 3291, 3311, 3332, 3352, 3372, 3393, 
3413, 3434, 3454, 3475, 3495, 3516, 3536, 3556, 3577, 3597, 3618, 
3638, 3659, 3679, 3700, 3720, 3740, 3761, 3781, 3802, 3822, 3843, 
3863, 3884, 3904, 3924, 3945, 3965, 3986), ms_to_on_acc = c(551185497, 
551185517, 551185538, 551185558, 551185578, 551185599, 551185619, 
551185640, 551185660, 551185681, 551185701, 551185722, 551185742, 
551185762, 551185783, 551185803, 551185824, 551185844, 551185865, 
551185885, 551185905, 551185926, 551185946, 551185967, 551185987, 
551186008, 551186028, 551186049, 551186069, 551186089, 551186110, 
551186130, 551186151, 551186171, 551186192, 551186212, 551186233, 
551186253, 551186273, 551186294, 551186314, 551186335, 551186355, 
551186376, 551186396, 551186417, 551186437, 551186457, 551186478, 
551186498, 551186519, 551186539, 551186560, 551186580, 551186601, 
551186621, 551186641, 551186662, 551186682, 551186703, 551186723, 
551186744, 551186764, 551186785, 551186805, 551186825, 551186846, 
551186866, 551186887, 551186907, 551186928, 551186948, 551186969, 
551186989, 551187009, 551187030, 551187050, 551187071, 551187091, 
551187112, 551187132, 551187153, 551187173, 551187193, 551187214, 
551187234, 551187255, 551187275, 551187296, 551187316, 551187337, 
551187357, 551187377, 551187398, 551187418, 551187439, 551187459, 
551187480, 551187500, 551187521, 551187541, 551187561, 551187582, 
551187602, 551187623, 551187643, 551187664, 551187684, 551187704, 
551187725, 551187745, 551187766, 551187786, 551187807, 551187827, 
551187848, 551187868, 551187888, 551187909, 551187929, 551187950, 
551187970, 551187991, 551188011, 551188031, 551188052, 551188072, 
551188093, 551188113, 551188134, 551188154, 551188174, 551188195, 
551188215, 551188236, 551188256, 551188277, 551188297, 551188318, 
551188338, 551188358, 551188379, 551188399, 551188420, 551188440, 
551188461, 551188481, 551188501, 551188522, 551188542, 551188563, 
551188583, 551188604, 551188624, 551188645, 551188665, 551188685, 
551188706, 551188726, 551188747, 551188767, 551188788, 551188808, 
551188829, 551188849, 551188869, 551188890, 551188910, 551188931, 
551188951, 551188972, 551188992, 551189013, 551189033, 551189053, 
551189074, 551189094, 551189115, 551189135, 551189156, 551189176, 
551189197, 551189217, 551189237, 551189258, 551189278, 551189299, 
551189319, 551189340, 551189360, 551189381, 551189401, 551189421, 
551189442, 551189462, 551189483), x_acc = c(12800, 12096, 11776, 
15872, 17216, 20096, 24000, 22528, 21824, 22144, 18560, 15872, 
14784, 13952, 13440, 13312, 12480, 11648, 12032, 12672, 12544, 
12992, 11968, 11648, 10688, 12544, 14592, 17088, 22016, 24896, 
25408, 25408, 20864, 17920, 15744, 14400, 13504, 13376, 13312, 
12928, 13248, 13696, 14400, 13760, 12096, 11136, 11968, 11008, 
12672, 15232, 16384, 20928, 24192, 23232, 22464, 20352, 17664, 
15808, 14784, 14016, 13120, 12416, 11392, 11136, 12224, 13184, 
12416, 12032, 11840, 12160, 11264, 14400, 16512, 19328, 24768, 
25600, 22592, 23872, 20864, 17664, 15232, 13632, 13184, 12608, 
12608, 12544, 12672, 13568, 14272, 14272, 13504, 11904, 12160, 
12480, 12160, 15040, 16960, 21120, 24576, 22592, 20224, 20544, 
18112, 16192, 14784, 13760, 13248, 13312, 12544, 11968, 12224, 
13120, 12800, 13184, 12416, 12288, 12160, 10560, 14400, 16384, 
19392, 23936, 23680, 24320, 23936, 19584, 17024, 14848, 13120, 
12160, 11904, 12224, 12288, 12864, 13952, 14592, 13760, 12608, 
11776, 12864, 12032, 14464, 17216, 18304, 21120, 23616, 22272, 
21440, 19520, 17024, 15552, 14592, 13440, 13184, 12864, 12032, 
12224, 12224, 13056, 13312, 13312, 12608, 12544, 11520, 15552, 
17088, 19648, 22912, 21056, 22656, 22144, 18368, 16064, 14400, 
13120, 12352, 12672, 12736, 12864, 13248, 14016, 14208, 13760, 
11584, 11264, 12544, 12992, 15616, 16512, 18240, 22272, 24448, 
23424, 22720, 19264, 16384), y_acc = c(4480, 3200, 3008, 5952, 
9856, 14464, 15552, 12288, 10176, 7872, 4672, 3008, 3072, 3712, 
4608, 5760, 6464, 6656, 6784, 6016, 4928, 3968, 2944, 3520, 4224, 
5376, 6144, 5568, 6656, 8448, 8064, 6336, 5184, 5888, 6912, 7744, 
6912, 5632, 4544, 4032, 3776, 4096, 5120, 5440, 5312, 4544, 3456, 
1984, 2944, 7488, 12480, 15936, 14336, 11968, 9984, 7552, 4544, 
3264, 3584, 4160, 5056, 5952, 6720, 7104, 6848, 5760, 4288, 3520, 
2944, 3584, 4352, 5376, 6144, 5952, 7424, 7936, 6400, 5312, 5120, 
5696, 6144, 6720, 6528, 5440, 4544, 3840, 3520, 3968, 5120, 5888, 
6016, 5760, 4736, 3328, 2496, 5504, 9664, 14784, 15936, 12928, 
10048, 7936, 5568, 3968, 3264, 3264, 3968, 4672, 5312, 6080, 
6848, 6848, 5824, 4928, 3968, 3648, 4224, 5056, 6336, 6912, 6464, 
7680, 8064, 7232, 6080, 5376, 5632, 6144, 6336, 5568, 4608, 4096, 
3776, 3904, 4352, 5696, 6080, 6080, 5632, 4416, 2496, 3200, 7424, 
12160, 16192, 14720, 12224, 9472, 7360, 4992, 3840, 3648, 4096, 
4672, 5248, 6080, 6912, 7296, 6528, 5120, 3904, 3392, 4160, 4928, 
6464, 7232, 6848, 7808, 7168, 6208, 5632, 5056, 5376, 5824, 5760, 
5312, 4672, 3968, 3648, 3584, 4416, 5312, 6016, 5824, 5248, 4288, 
3200, 5056, 8896, 13696, 14784, 12608, 9856, 7616, 5312, 3648
), z_acc = c(1600, 2048, 1856, 1408, 1664, 2176, 2240, 1536, 
1600, 1536, 1152, 1152, 1536, 1600, 1728, 1600, 960, -448, -1280, 
-1472, -1024, 0, 576, 1408, 1536, 1600, 1600, 2112, 2112, 1024, 
-576, -768, 576, 2176, 2944, 2560, 2304, 1728, 832, -448, -1472, 
-2240, -1856, -512, 448, 1280, 2368, 2176, 1088, 1536, 1600, 
1920, 1664, 1600, 1344, 704, 640, 768, 1408, 1600, 1600, 1280, 
64, -896, -1280, -1216, -512, 64, 448, 1216, 1024, 1344, 1216, 
1600, 1856, 448, -1152, -704, 768, 2240, 3136, 3136, 2880, 2176, 
1024, -256, -1536, -2880, -2944, -1664, 64, 1088, 1984, 2624, 
1792, 832, 1088, 1472, 1600, 1024, 1152, 1280, 1024, 768, 896, 
1408, 1536, 1536, 1216, 192, -1024, -1920, -1920, -1280, -448, 
320, 896, 1088, 1728, 1728, 2048, 1536, -640, -1664, -1088, 448, 
1920, 2688, 2496, 2496, 1984, 704, -448, -2240, -3584, -3712, 
-2176, -128, 1152, 2176, 2560, 2496, 2240, 2304, 2496, 2496, 
2112, 1664, 1216, 640, 896, 1152, 1344, 1408, 1344, 320, -1024, 
-2432, -2944, -2752, -1728, -832, -64, 384, 1344, 1600, 1408, 
704, -1792, -3200, -2304, -768, 704, 1600, 1664, 1344, 832, -320, 
-2048, -3840, -5248, -4480, -1984, -64, 1152, 2048, 2112, 1280, 
1152, 1152, 832, 704, 896, 896, 704, 768)), row.names = c(NA, 
196L), class = "data.frame")

and,

dput(MAG_20220217_01_)
structure(list(time = structure(c(1645056800, 1645056800.096, 
1645056800.192, 1645056800.289, 1645056800.385, 1645056800.481, 
1645056800.577, 1645056800.673, 1645056800.77, 1645056800.865, 
1645056800.961, 1645056801.058, 1645056801.154, 1645056801.25, 
1645056801.346, 1645056801.442, 1645056801.538, 1645056801.634, 
1645056801.731, 1645056801.826, 1645056801.923, 1645056802.019, 
1645056802.115, 1645056802.211, 1645056802.307, 1645056802.403, 
1645056802.499, 1645056802.595, 1645056802.692, 1645056802.787, 
1645056802.884, 1645056802.98, 1645056803.076, 1645056803.172, 
1645056803.268, 1645056803.364, 1645056803.46, 1645056803.556, 
1645056803.653, 1645056803.748, 1645056803.845, 1645056803.941
), class = c("POSIXct", "POSIXt"), tzone = "Europe/Paris"), ms = c(0, 
96, 192, 289, 385, 481, 577, 673, 770, 865, 961, 1058, 1154, 
1250, 1346, 1442, 1538, 1634, 1731, 1826, 1923, 2019, 2115, 2211, 
2307, 2403, 2499, 2595, 2692, 2787, 2884, 2980, 3076, 3172, 3268, 
3364, 3460, 3556, 3653, 3748, 3845, 3941), ms_to_on_mag = c(551175445, 
551175541, 551175637, 551175734, 551175830, 551175926, 551176022, 
551176118, 551176215, 551176310, 551176406, 551176503, 551176599, 
551176695, 551176791, 551176887, 551176983, 551177079, 551177176, 
551177271, 551177368, 551177464, 551177560, 551177656, 551177752, 
551177848, 551177944, 551178040, 551178137, 551178232, 551178329, 
551178425, 551178521, 551178617, 551178713, 551178809, 551178905, 
551179001, 551179098, 551179193, 551179290, 551179386), x_mag = c(-305, 
-308, -301, -303, -300, -317, -307, -307, -303, -288, -296, -287, 
-290, -274, -278, -289, -296, -294, -293, -278, -302, -297, -304, 
-281, -280, -308, -304, -296, -299, -298, -310, -304, -297, -298, 
-300, -296, -305, -304, -298, -309, -305, -305), y_mag = c(735, 
765, 752, 762, 746, 738, 735, 752, 720, 706, 725, 759, 767, 748, 
745, 740, 726, 721, 731, 707, 714, 723, 749, 767, 767, 753, 702, 
696, 718, 733, 747, 753, 747, 772, 776, 746, 727, 732, 728, 730, 
738, 776), z_mag = c(24, 63, 24, 27, 49, 90, 123, 104, -2, 16, 
12, 8, 7, -70, -30, -30, 11, 34, -34, -36, -2, 11, 31, -33, -20, 
7, 76, 66, -19, 8, 25, 59, 26, 11, 35, 65, 103, 87, 19, 53, 72, 
115)), row.names = c(NA, 42L), class = "data.frame")

Solution

  • How about a rolling-join instead of fuzzy one?

    library(dplyr)
    left_join(ACCL_20220217_01_, MAG_20220217_01_, join_by(closest(time >= time)), suffix = c("", "_mag")) |>
      left_join(MAG_20220217_01_, join_by(closest(time <= time)), suffix = c("", "_2")) |>
      rename_with(~ sub("_mag_mag", "_mag", sub("_2", "_mag_2", .x))) |>
      rename(ms_acc = ms) |>
      mutate(
        across(ends_with("_mag"),
               ~ if_else(abs(time_mag - time) < abs(time_mag_2 - time),
                         .x, cur_data()[[paste0(cur_column(), "_2")]]))
      ) |>
      select(-ends_with("_2")) |>
      head(n = 6)
    #                  time ms_acc ms_to_on_acc x_acc y_acc z_acc            time_mag ms_mag ms_to_on_mag x_mag y_mag z_mag
    # 1 2022-02-17 01:13:20      0    551185497 12800  4480  1600 2022-02-17 01:13:20      0    551175445  -305   735    24
    # 2 2022-02-17 01:13:20     20    551185517 12096  3200  2048 2022-02-17 01:13:20      0    551175445  -305   735    24
    # 3 2022-02-17 01:13:20     41    551185538 11776  3008  1856 2022-02-17 01:13:20      0    551175445  -305   735    24
    # 4 2022-02-17 01:13:20     61    551185558 15872  5952  1408 2022-02-17 01:13:20     96    551175541  -308   765    63
    # 5 2022-02-17 01:13:20     81    551185578 17216  9856  1664 2022-02-17 01:13:20     96    551175541  -308   765    63
    # 6 2022-02-17 01:13:20    102    551185599 20096 14464  2176 2022-02-17 01:13:20     96    551175541  -308   765    63
    

    I do a double-join because closest() is going to find the closest on one side only, meaning a true rolling join. Since you want the closest even if it comes before the ACCL observation, I join again with the reversed inequality <= and then do a simple if_else to determine which of the two (first-join or second-join) to use.

    The rename_with() is just some scotch-tape to make sure all the names exist between the two joins.