rdata.tableinner-joinnon-equi-join

Conditional non-equi join in data.table


I have a dataset with some missing data (in reality <1%). We are going to use data from other sites to fill in the missing data. We have a data.table named dt1, which contains the data. We have another data.table named dt2 and this is an index of the sites that can be used to replace (rsite) the missing data from a site. In dt1 I would like to create a column in dt1 named UsedSite that records which site the data we end up using are from.

Create some data to use

dt1<-data.table(site=c("s1","s1","s2","s2"), x=c(NA, 1, NA,1))
dt2<-data.table(site=c("s1","s1","s2","s2"), rsite=c("s3","s3","s4","s4"))

Since only a small fraction of data are missing, start by setting the values of UsedSite to site

dt1[,UsedSite:=site]

The following is where I can't figure out how to do this. I want to just replace the UsedSite values with rsite where x is NA. Not at all surprisingly, the line below replaces everything - I don't want that. I don't understand how to instruct it to consider only lines where a condition (is.na(x)) is being met.

dt1[dt2,UsedSite:=rsite, on=.(site)]

My natural impulse is to try dt1[dt2&is.na(x),UsedSite:=rsite, on=.(site)], but that doesn't work. Thoughts? I would like to stick with the data.table paradigm in this.


Solution

  • We may use fcoalesce here to replace the NA values with the corresponding values of 'rsite' If it is a non-NA, it won't get replaced as fcoalesce will return the first non-NA for each row

    library(data.table)
    dt1[dt2, UsedSite := fcoalesce(as.character(x), rsite), on = .(site)]
    

    -output

    > dt1
       site  x UsedSite
    1:   s1 NA       s3
    2:   s1  1        1
    3:   s2 NA       s4
    4:   s2  1        1
    

    Update

    dt1[dt2, UsedSite := fifelse(is.na(x), rsite, site), on = .(site)]
    

    -output

    > dt1
       site  x UsedSite
    1:   s1 NA       s3
    2:   s1  1       s1
    3:   s2 NA       s4
    4:   s2  1       s2