rdata.tabledplyr

Assign a value based on closest neighbour from other data frame


With generic data:

set.seed(456)

a <- sample(0:1,50,replace = T)
b <- rnorm(50,15,5)
df1 <- data.frame(a,b)

c <- seq(0.01,0.99,0.01)
d <- rep(NA, 99)
for (i in 1:99) {
  d[i] <- 0.5*(10*c[i])^2+5
}
df2 <- data.frame(c,d)

For each df1$b we want to find the nearest df2$d. Then we create a new variable df1$XYZ that takes the df2$c value of the nearest df2$d

This question has guided me towards data.table library. But I am not sure if ddplyr and group_by can also be used:

Here was my data.table attempt:

library(data.table)
dt1 <- data.table( df1 , key = "b" )
dt2 <- data.table( df2 , key = "d" )

dt[ ldt , list( d ) , roll = "nearest" ]

Solution

  • Here's one way with data.table:

    require(data.table)
    setDT(df1)[, XYZ := setDT(df2)[df1, c, on=c(d="b"), roll="nearest"]]
    

    You need to get df2$c corresponding to the nearest value in df2$d for every df1$b. So, we need to join as df2[df1] which results in nrow(df1) rows.That can be done with setDT(df2)[df1, c, on=c(d="b"), roll="nearest"].

    It returns the result you require. All we need to do is to add this back to df1 with the name XYZ. We do that using :=.


    The thought process in constructing the rolling join is something like this (assuming df1 and df2 are both data tables):

    1. We need get some value(s) for each row of df1. That means, i = df1 in x[i] syntax.

      df2[df1]
      
    2. We need to join df2$d with df1$b. Using on= that'd be:

      df2[df1, on=c(d="b")]
      
    3. We need just the c column. Use j to select just that column.

      df2[df1, c, on=c(d="b")]
      
    4. We don't need equi-join but roll to nearest join.

      df2[df1, c, on=c(d="b"), roll="nearest"]
      

    Hope this helps.