I have a dataset with some missing data (in reality <1%). We are going to use data from other sites to fill in the missing data. We have a data.table
named dt1
, which contains the data. We have another data.table
named dt2
and this is an index of the sites that can be used to replace (rsite
) the missing data from a site. In dt1
I would like to create a column in dt1
named UsedSite
that records which site the data we end up using are from.
Create some data to use
dt1<-data.table(site=c("s1","s1","s2","s2"), x=c(NA, 1, NA,1))
dt2<-data.table(site=c("s1","s1","s2","s2"), rsite=c("s3","s3","s4","s4"))
Since only a small fraction of data are missing, start by setting the values of UsedSite
to site
dt1[,UsedSite:=site]
The following is where I can't figure out how to do this. I want to just replace the UsedSite
values with rsite
where x
is NA
. Not at all surprisingly, the line below replaces everything - I don't want that. I don't understand how to instruct it to consider only lines where a condition (is.na(x)
) is being met.
dt1[dt2,UsedSite:=rsite, on=.(site)]
My natural impulse is to try dt1[dt2&is.na(x),UsedSite:=rsite, on=.(site)]
, but that doesn't work. Thoughts? I would like to stick with the data.table
paradigm in this.
We may use fcoalesce
here to replace the NA values with the corresponding values of 'rsite' If it is a non-NA, it won't get replaced as fcoalesce
will return the first non-NA for each row
library(data.table)
dt1[dt2, UsedSite := fcoalesce(as.character(x), rsite), on = .(site)]
-output
> dt1
site x UsedSite
1: s1 NA s3
2: s1 1 1
3: s2 NA s4
4: s2 1 1
dt1[dt2, UsedSite := fifelse(is.na(x), rsite, site), on = .(site)]
-output
> dt1
site x UsedSite
1: s1 NA s3
2: s1 1 s1
3: s2 NA s4
4: s2 1 s2