r dplyr data.table data-analysis exploratory-data-analysis

Fill a data table by matching column values from onother table (lookup table), and interpolate on another variable

I have a very long table with data in R, with some key columns and a lookup table with the same key columns and some extra variables. I want to select from the lookup table, a subset based on the same keys and fill a new column in the data table by interpolating on a common variable.

I try to give a simplified example:

library(data.table)
data <- data.table(A = c("A","A","A","B","B","B","C","C","C"),
                   B = c(1,1,1,1,1,1,2,2,2),
                   C = rep(c(0.15, 0.22, 0.3),3))
data
#    A B    C
# 1: A 1 0.15
# 2: A 1 0.22
# 3: A 1 0.30
# 4: B 1 0.15
# 5: B 1 0.22
# 6: B 1 0.30
# 7: C 2 0.15
# 8: C 2 0.22
# 9: C 2 0.30

look <- data.table(A = c("A","A","A","B","B","B","C","C","C"),
                   B = c(1,1,1,1,1,1,2,2,2),
                   C = rep(c(0.1, 0.2, 0.3),3),
                   D = c(10, 20, 30, 11,22,33,12,24,36))
look
#    A B   C  D
# 1: A 1 0.1 10
# 2: A 1 0.2 20
# 3: A 1 0.3 30
# 4: B 1 0.1 11
# 5: B 1 0.2 22
# 6: B 1 0.3 33
# 7: C 2 0.1 12
# 8: i 2 0.2 24
# 9: C 2 0.3 36

I want for a matching A and B columns to use look$C and look$D and create a new data$D column by interpolating based on the data$C.

I have a for loop that for each matching A, B combination creates a subset matched from look runs f <- approxfun(look$C, look$D) and apply it data[ A == 'A' & B == 1, D := f(C) ].

so in one iteration will be like

f <- approxfun(c(0.1, 0.2, 0.3), c(10, 20, 30) )
f(c(0.15, 0.22, 0.3))
## and the data table will become
data
#    A B    C  B
# 1: A 1 0.15 15
# 2: A 1 0.22 22 
# 3: A 1 0.30 30
# 4: B 1 0.15
# 5: B 1 0.22
# 6: B 1 0.30
# 7: C 2 0.15
# 8: C 2 0.22
# 9: C 2 0.30

The problem is that the nrow(DATA) > 6000000 and nrow(look) > 20000, and it has to be done for 3 new columns, and the iteration have been running for over a day now. So I need a more efficient approach.

Probably there is a better way with data.table or dlpyr, but I haven't managed to formulate one.

And for some context:

In my field, it is common practice, to work with observations and equivalent modelled values, used for reference. We can run a model for each observation, or a range of values. Some of the common variables, between observations and model can be exact, and hopefully only one close enough, and can be interpolated from the modelled values.

In this case, I work with a 30 years dataset with resolution of 1 minute, the usable observations are currently > 6000000. The modelled values, for now, are just monthly climatological variations > 27000. In the future, I expect the modelled values to be much more, with a finer time step, so the matching and interpolation problem will be a major bottleneck in the analysis.

Solution:

@thothal method gave me the exact results as my first approach, with the iterative evaluation. The dplyr didn't, although, my data are complex, and maybe I miss something in the implementation. From ~2 hours, I got to just some seconds! Thanks to everybody for your suggestions, you gave me some very useful insights for this kind of problem.

Solution

Let me drop in a data.table solution:

## 1. Create an `approxfun` for each `A/B` combination

## The + 0 trick is necessary b/c otherwise data.table does not evaluate C, D
## while creation and thus keeps just the last chunk of it 

(fns <- look[, .(f = list(approxfun(C + 0, D + 0))), .(A, B)])
#    A B             f
# 1: A 1 <function[1]>
# 2: B 1 <function[1]>
# 3: C 2 <function[1]>

## 2. Join it to data and apply the function
data[fns, .(A, B, C, D = Map(\(f, x) f(x), f, C)), on = .(A, B)]
#    A B    C    D
# 1: A 1 0.15   15
# 2: A 1 0.22   22
# 3: A 1 0.30   30
# 4: B 1 0.15 16.5
# 5: B 1 0.22 24.2
# 6: B 1 0.30   33
# 7: C 2 0.15   18
# 8: C 2 0.22 26.4
# 9: C 2 0.30   36

## w/o Map you would do:
## data[fns, on = .(A, B)][, .(C, f[[1]](C)), .(A, B)]

Admittedly, the + 0 trick is ugly and if somebody knows a more canonical way to solve it, I am very happy to hear, b/c leaving it out won't work (most likely b/c of lazy evaluation):

data[look[, .(f = list(approxfun(C, D))), .(A, B)], 
     .(A, B, C, D = Map(\(f, x) f(x), f, C)), on = .(A, B)]
#    A B    C    D
# 1: A 1 0.15   18 <<- always approxfun(c(.1, .2, .3), c(12, 24, 26))
# 2: A 1 0.22 26.4
# 3: A 1 0.30   36
# 4: B 1 0.15   18
# 5: B 1 0.22 26.4
# 6: B 1 0.30   36
# 7: C 2 0.15   18
# 8: C 2 0.22 26.4
# 9: C 2 0.30   36