I have some data in a table and I would like to do a non-equi join (I think is the right terminology) and filter it by different ranges for different groups. In the below example I would like to filter group "a" so that it only returns values between 1 and 20 (inclusive) and group "b" so it only returns values between 80 and 100 (inclusive). My reading suggests that inrange
should be the item to use. I get how to use it in the generic case, but I am not sure how to get it to run with different ranges by group. (example code adapted from ?inrange
)
create sample data
set.seed(1234)
Y = data.table(a=sample(1:100,100), val=runif(100), group=c(rep("a",50),rep("b",50)))
range = data.table(group=c("a","b"),start = c(1,80), end = c(20,100))
Try to filter
Y[inrange(a, range$start, range$end),,by=group]
This obviously does not work and instead applies those ranges to the entire dataset and throws the error message Ignoring by= because j= is not supplied
. I think it is clear to me that this is not working because I haven't created the 'join' between the range table and Y, but I am not seeing how to make the two tables communicate grouping via inrange
.
Note: In reality values in a will be posixct datetimes, but for the sake of simplicity I am not using that here.
Perhaps:
Y[range, K := TRUE, on = .(group, a >= start, a <= end)][!is.na(K),]
# a val group K
# <int> <num> <char> <lgcl>
# 1: 9 0.60189755 a TRUE
# 2: 5 0.99874081 a TRUE
# 3: 16 0.55512663 a TRUE
# 4: 4 0.42944396 a TRUE
# 5: 14 0.43101637 a TRUE
# 6: 3 0.47880269 a TRUE
# 7: 2 0.02220682 a TRUE
# 8: 6 0.63891131 a TRUE
# 9: 8 0.83470266 a TRUE
# 10: 17 0.98304402 a TRUE
# 11: 98 0.76785547 b TRUE
# 12: 94 0.30766574 b TRUE
# 13: 88 0.25814665 b TRUE
# 14: 89 0.49954639 b TRUE
# 15: 83 0.50892062 b TRUE
# 16: 95 0.49443856 b TRUE
# 17: 97 0.56695890 b TRUE
# 18: 87 0.98970989 b TRUE
# 19: 82 0.53190509 b TRUE
# 20: 100 0.59662376 b TRUE
# a val group K
There are other ways to do this, but they involve renaming or loss of information. For instance,
left-join range
and Y
, we lose a
:
Y[range, on = .(group, a >= start, a <= end)]
# a val group a.1
# <int> <num> <char> <int>
# 1: 1 0.60189755 a 20
# 2: 1 0.99874081 a 20
# 3: 1 0.55512663 a 20
# ...
# 18: 80 0.98970989 b 100
# 19: 80 0.53190509 b 100
# 20: 80 0.59662376 b 100
# a val group a.1
The fix is to copy Y$a
into a new variable and join on it instead:
Y[,a1 := a][range, on = .(group, a1 >= start, a1 <= end)]
# a val group a1 a1.1
# <int> <num> <char> <int> <int>
# 1: 9 0.60189755 a 1 20
# 2: 5 0.99874081 a 1 20
# 3: 16 0.55512663 a 1 20
# ...
# 18: 87 0.98970989 b 80 100
# 19: 82 0.53190509 b 80 100
# 20: 100 0.59662376 b 80 100
# a val group a1 a1.1
left-join Y
and range
, we get a
duplicated into start
and end
but no clear indicator to filter on:
range[Y, on = .(group, start <= a, end >= a)]
# group start end val
# <char> <int> <int> <num>
# 1: a 28 28 0.85026492
# 2: a 80 80 0.23466126
# 3: a 22 22 0.98816745
# ...
# 98: b 82 82 0.53190509
# 99: b 100 100 0.59662376
# 100: b 30 30 0.26388647
# group start end val
A remedy would be to copy in another field that would give us the indicator of merge that we need to be able to filter. But even with that we have to rename to regain a
's data:
range[, K := TRUE][Y, on = .(group, start <= a, end >= a)][ !is.na(K), ]
# group start end K val
# <char> <int> <int> <lgcl> <num>
# 1: a 9 9 TRUE 0.60189755
# 2: a 5 5 TRUE 0.99874081
# 3: a 16 16 TRUE 0.55512663
# ...
# 18: b 87 87 TRUE 0.98970989
# 19: b 82 82 TRUE 0.53190509
# 20: b 100 100 TRUE 0.59662376
# group start end K val