Please have a look at the following code. It constructs a data.table and does a kind of unequal self join (except that I copied the table before doing the join, so I know where what column comes from).
library(data.table)
dt <- data.table(ind=1:5, x=letters[1:5])
dtcopy <- copy(dt)
dtcopy[,indminus3 := ind - 3]
setnames(dtcopy, names(dtcopy), paste0(names(dtcopy), "copy"))
dtcopy[,indcopy2 := indcopy]
dt[dtcopy, on = c("ind<=indcopy", "ind>indminus3copy")]
The idea is, that for each index "i" I get all the entries that have the index between "i-3" (exclusive) and "i".
This is what I get:
ind x ind.1 xcopy indcopy2
1: 1 a -2 a 1
2: 2 a -1 b 2
3: 2 b -1 b 2
4: 3 a 0 c 3
5: 3 b 0 c 3
6: 3 c 0 c 3
7: 4 b 1 d 4
8: 4 c 1 d 4
9: 4 d 1 d 4
10: 5 c 2 e 5
11: 5 d 2 e 5
12: 5 e 2 e 5
I have several questions about that:
ind.1
come from? What does it mean? Why is it there?indcopy2
? Is that correct?ind
and x
not match? In the original dt
there is no row with ind==2 and x=="a". It seems like the matches are between ind
and xcopy
, I find this confusing. Why is it the way it is?These are my questions, now let me show you what I want to do with it, maybe you can comment on that as well:
There is no frollmedian, and using frollapply is very slow. In addition, I need an adaptive window, which is not really well implemented anyways. So here is some code that computes the rolling median using all entries that have an age not older than 10 seconds as the basis for aggregation:
library(data.table)
dt <- data.table(time = as.POSIXct("2020-06-01") + c(13, 20, 23, 26, 30, 38, 42, 50),
x = c(0, .1, 1, 10, 100, 1000, 10000, 100000))
dtcopy <- copy(dt)
dtcopy[,timeminus10 := time - 10]
setnames(dtcopy, names(dtcopy), paste0(names(dtcopy), "copy"))
dtcopy[,timecopy2 := timecopy]
dt[dtcopy, on = c("time<=timecopy", "time>=timeminus10copy")][,median(x),by=timecopy2]$V1
What do think of this approach? It is definitively a lot faster than frollapply (more than 100 times faster if I remember correctly for data with 10000 entries or so). What I personally do not understand is why dt[dtcopy, on = c("time<=timecopy", "time>=timeminus10copy")][,median(x),by=time]
also gives a correct result and what I understand even less is that the seemingly same statement dt[dtcopy, on = c("time<=timecopy", "time>=timeminus10copy"),median(x),by=time]
gives a wrong result.
The last puzzling thing I want to share with you is that dt[dtcopy, on = c("time<=timecopy", "time>=timeminus10copy")][,median(x),by=timecopy2]
works but that dt[dtcopy, on = c("time<=timecopy", "time>=timeminus10copy"),median(x),by=timecopy2]
gives an error.
For the first set of questions, the non-equi statement ind<=indcopy
will result in a column named ind
that takes the value of indcopy
in the row that meets the conditions of the full non-equi statement. Likewise, the non-equi statement ind>indminus3copy
would also result in a column named ind
, except that there is already a column named ind
that resulted from the first non-equi statement, so it is renamed ind.1
. ind.1
takes the value of indminus3copy
from the row that meets the conditions of the full non-equi statement.
To summarize, in the final output, the value in ind
comes from indcopy
, and the value in ind.1
comes from indminus3copy
. The original ind
column from dt
is not carried to the final output.
It is easier to see what is happening by creating a copy of the columns used in the comparison (i.e., for A[B, on = .(...)]
create a copy of the columns of A
that appear in ...
):
dt[,ind2 := ind]
dt[dtcopy, on = .(ind<=indcopy, ind>indminus3copy)]
#> ind x ind2 ind.1 xcopy indcopy2
#> <int> <char> <int> <int> <char> <int>
#> 1: 1 a 1 -2 a 1
#> 2: 2 a 1 -1 b 2
#> 3: 2 b 2 -1 b 2
#> 4: 3 a 1 0 c 3
#> 5: 3 b 2 0 c 3
#> 6: 3 c 3 0 c 3
#> 7: 4 b 2 1 d 4
#> 8: 4 c 3 1 d 4
#> 9: 4 d 4 1 d 4
#> 10: 5 c 3 2 e 5
#> 11: 5 d 4 2 e 5
#> 12: 5 e 5 2 e 5
dt[,ind3 := ind]
dt[dtcopy, on = .(ind<=indcopy, ind2>indminus3copy)]
#> ind x ind2 ind3 xcopy indcopy2
#> <int> <char> <int> <int> <char> <int>
#> 1: 1 a -2 1 a 1
#> 2: 2 a -1 1 b 2
#> 3: 2 b -1 2 b 2
#> 4: 3 a 0 1 c 3
#> 5: 3 b 0 2 c 3
#> 6: 3 c 0 3 c 3
#> 7: 4 b 1 2 d 4
#> 8: 4 c 1 3 d 4
#> 9: 4 d 1 4 d 4
#> 10: 5 c 2 3 e 5
#> 11: 5 d 2 4 e 5
#> 12: 5 e 2 5 e 5
The median calculation using a non-equi join can be simplified a bit:
dt <- data.table(time = as.POSIXct("2020-06-01") + c(13, 20, 23, 26, 30, 38, 42, 50),
x = c(0, .1, 1, 10, 100, 1000, 10000, 100000))
dt[,time10 := time - 10]
dt[dt, on = .(time <= time, time >= time10)][,.(xmed10 = median(x)), time]
#> time xmed10
#> <POSc> <num>
#> 1: 2020-06-01 00:00:13 0.0e+00
#> 2: 2020-06-01 00:00:20 5.0e-02
#> 3: 2020-06-01 00:00:23 1.0e-01
#> 4: 2020-06-01 00:00:26 1.0e+00
#> 5: 2020-06-01 00:00:30 5.5e+00
#> 6: 2020-06-01 00:00:38 5.5e+02
#> 7: 2020-06-01 00:00:42 5.5e+03
#> 8: 2020-06-01 00:00:50 5.5e+04
This is a good approach. However, for large datasets the join may create a very large intermediate table and may run into memory problems. One way around that is with .EACHI
:
dt[dt, on = .(time >= time10, time <= time), .(xmed10 = median(x)),
.EACHI][,time := NULL][]
#> time xmed10
#> <POSc> <num>
#> 1: 2020-06-01 00:00:13 0.0e+00
#> 2: 2020-06-01 00:00:20 5.0e-02
#> 3: 2020-06-01 00:00:23 1.0e-01
#> 4: 2020-06-01 00:00:26 1.0e+00
#> 5: 2020-06-01 00:00:30 5.5e+00
#> 6: 2020-06-01 00:00:38 5.5e+02
#> 7: 2020-06-01 00:00:42 5.5e+03
#> 8: 2020-06-01 00:00:50 5.5e+04
Demonstrating the advantage of .EACHI
with a larger dataset (note the difference in memory usage):
N <- 1e5
lookback <- 1e4
dt <- data.table(time = as.POSIXct("2020-06-01") + sample(2*N, N), x = runif(N))
dt[,time0 := time - lookback]
bench::mark(
by = dt[dt, on = .(time <= time, time >= time0)][,.(xmed = median(x)), time],
.EACHI = dt[dt, on = .(time >= time0, time <= time), .(xmed = median(x)),
.EACHI][,time := NULL]
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 by 37.8s 37.8s 0.0264 38.23GB 0.132
#> 2 .EACHI 17.9s 17.9s 0.0559 7.28GB 0.615
Note that .EACHI
will only be faster if the intermediate table would be approaching available memory.
The reason dt[dtcopy, on = c("time<=timecopy", "time>=timeminus10copy"),median(x),by=time]
isn't working the way you want is because the grouping operation is done on dt
before the join.
To make it more obvious, consider
dt[dtcopy, on = .(time <= timecopy, time >= timeminus10copy), .(x), by = time]
#> time x
#> <POSc> <num>
#> 1: 2020-06-01 00:00:13 0e+00
#> 2: 2020-06-01 00:00:13 0e+00
#> 3: 2020-06-01 00:00:13 0e+00
#> 4: 2020-06-01 00:00:20 1e-01
#> 5: 2020-06-01 00:00:20 1e-01
#> 6: 2020-06-01 00:00:20 1e-01
#> 7: 2020-06-01 00:00:20 1e-01
#> 8: 2020-06-01 00:00:23 1e+00
#> 9: 2020-06-01 00:00:23 1e+00
#> 10: 2020-06-01 00:00:23 1e+00
#> 11: 2020-06-01 00:00:26 1e+01
#> 12: 2020-06-01 00:00:26 1e+01
#> 13: 2020-06-01 00:00:30 1e+02
#> 14: 2020-06-01 00:00:30 1e+02
#> 15: 2020-06-01 00:00:38 1e+03
#> 16: 2020-06-01 00:00:38 1e+03
#> 17: 2020-06-01 00:00:42 1e+04
#> 18: 2020-06-01 00:00:42 1e+04
#> 19: 2020-06-01 00:00:50 1e+05
This is essentially the same as doing
dt[,time2 := time]
dt[1][dtcopy, on = .(time <= timecopy, time >= timeminus10copy), .(time = time2, x), nomatch = 0]
#> time x
#> <POSc> <num>
#> 1: 2020-06-01 00:00:13 0
#> 2: 2020-06-01 00:00:13 0
#> 3: 2020-06-01 00:00:13 0
dt[2][dtcopy, on = .(time <= timecopy, time >= timeminus10copy), .(time = time2, x), nomatch = 0]
#> time x
#> <POSc> <num>
#> 1: 2020-06-01 00:00:20 0.1
#> 2: 2020-06-01 00:00:20 0.1
#> 3: 2020-06-01 00:00:20 0.1
#> 4: 2020-06-01 00:00:20 0.1
...