I am trying to join two data tables with a data.table non-equi join. However, some of the dates in table 1, do not fall within a date range in table 2. What I want with these "nomatch", is to join them to the nearest date range. Lets says there are two ranges; 2002-01-01 to 2004-01-01 and another range; 2010-01-01 to 2012-01-01. The date 2009-01-01 will be out of range, however I want this date to be joined to the 2011-2012 range, as the year 2009 is closer to 2011, than the year 2004. I hope my question makes sense.
Example data
library(data.table)
# data.table with single dates
id <- c("xxx","xxx","xxx","yyy","yyy","yyy","zzz","zzz","zzz")
date <- as.Date(c("2019-02-01","2020-05-01","2021-02-01","2020-01-01","2020-07-01","2020-08-01","2021-03-01","2022-10-01","2023-11-16"))
single_date_dt <- data.table(id,date)
# data.table with range dates
id <- c("xxx","xxx","yyy","zzz","zzz")
weight_kg <- c(10,13,20,45,65)
start_date <- as.Date(c("2019-01-01","2021-01-01","2020-01-01","2021-01-01","2023-01-01"))
end_date <- as.Date(c("2019-12-31","2021-12-31","2021-01-01","2021-12-31","2023-12-31"))
range_date_dt <- data.table(id,weight_kg,start_date,end_date)
dates_joined <- range_date_dt[single_date_dt, on = .(id, start_date <= date, end_date >= date), nomatch = NA]
There are two NA's in the "dates_joined" data.table. The first NA should be "10" since 2020-05-01 is closer to 2019-12-31 than 2021-01-01, and the second NA should be 65, since 2022-10-01 is closer to 2023-01-01 than 2021-12-31.
The real dataset consist of several million observations with around 5000 "out of range" dates.
Thanks
I think we can safely assume the ranges won't overlap for the same id
, so we can perform the rolling join using a melt
ed range_date_dt
:
melt(range_date_dt, c("id", "weight_kg"), value.name = "date")[
,variable := NULL
][single_date_dt, on = .(id, date), roll = "nearest"]
#> id weight_kg date
#> 1: xxx 10 2019-02-01
#> 2: xxx 10 2020-05-01
#> 3: xxx 13 2021-02-01
#> 4: yyy 20 2020-01-01
#> 5: yyy 20 2020-07-01
#> 6: yyy 20 2020-08-01
#> 7: zzz 45 2021-03-01
#> 8: zzz 65 2022-10-01
#> 9: zzz 65 2023-11-16