I'm attempting to merge datasets based on drug
and dates.
The left (df1
) is a long claim dataset where people (id
) can appear multiple times with different claim
ids, drugs (drug
), and dates (claim_date
).
The right (df2
) is a long dataset of starting dates (price_date
) for effective drug
prices (price
). The price is effective until there is a new date.
I want to keep all records on the left (df1
). Some drugs like 40 do not have pricing data in df2
. Other drug have claims that occur before the first effective price. These should all result in NA when merged.
The real dataset has a few hundred thousand claims and maybe 40k pricing rows, so my tidyverse approaches were slow and not correct. I don't know data.table well, but I kept getting incorrect matches and unexpected NAs for the left hand side.
library(tidyverse)
# claims
df1 <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
claim = c("a", "b", "c", "d", "e", "f", "g"),
drug = c(10, 10, 20, 30, 31, 32, 40),
claim_date = ymd("2024-01-01", "2024-02-01",
"2024-01-01",
"2024-01-01", "2024-02-01", "2024-02-10",
"2024-03-13"))
# price
# no price data for drug 4
df2 <- tibble(drug = c(rep(10, 3), rep(20, 3), rep(30, 3),
rep(31, 3), rep(32, 3)),
price = c(1, 2, 3, #10
4, 5, 6, #20
7, 8, 9, #30
10, 11, 12, #31
13, 14, 15 #32
),
price_date = ymd("2024-01-01", "2024-02-01", "2024-03-01", #10
"2024-01-10", "2024-02-11", "2024-03-12", #20
"2024-01-20", "2024-02-21", "2024-03-22", #30
"2024-01-20", "2024-02-21", "2024-03-22", #31
"2024-01-20", "2024-02-21", "2024-03-22") #32
)
result <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
claim = c("a", "b", "c", "d", "e", "f", "g"),
drug = c(10, 10, 20, 30, 31, 32, 40),
claim_date = ymd("2024-01-01", "2024-02-01", #10
"2024-01-01", #20
"2024-01-01", #30
"2024-02-01", #31
"2024-02-10", #32
"2024-03-13" #40
),
price = c(1, 2, # 10
NA, # 20; claim before first price date
NA, # 30
10, # 31
13, # 32
NA # 40; no drug price info
))
df1
# A tibble: 7 × 4
id claim drug claim_date
<dbl> <chr> <dbl> <date>
1 1 a 10 2024-01-01
2 1 b 10 2024-02-01
3 2 c 20 2024-01-01
4 3 d 30 2024-01-01
5 3 e 31 2024-02-01
6 3 f 32 2024-02-10
7 4 g 40 2024-03-13
# A tibble: 15 × 3
drug price price_date
<dbl> <dbl> <date>
1 10 1 2024-01-01
2 10 2 2024-02-01
3 10 3 2024-03-01
4 20 4 2024-01-10
5 20 5 2024-02-11
6 20 6 2024-03-12
7 30 7 2024-01-20
8 30 8 2024-02-21
9 30 9 2024-03-22
10 31 10 2024-01-20
11 31 11 2024-02-21
12 31 12 2024-03-22
13 32 13 2024-01-20
14 32 14 2024-02-21
15 32 15 2024-03-22
> result
# A tibble: 7 × 5
id claim drug claim_date price
<dbl> <chr> <dbl> <date> <dbl>
1 1 a 10 2024-01-01 1
2 1 b 10 2024-02-01 2
3 2 c 20 2024-01-01 NA
4 3 d 30 2024-01-01 NA
5 3 e 31 2024-02-01 10
6 3 f 32 2024-02-10 13
7 4 g 40 2024-03-13 NA
If df2
is ordered, perhaps you can use a temporary variable i
:
library(dplyr)
df1 |>
left_join(mutate(df2, i=row_number()), by=join_by(drug, claim_date>=price_date)) |>
slice_max(i, n=1, by=claim) |>
select(-price_date, -i)
giving
# A tibble: 7 × 5
id claim drug claim_date price
<dbl> <chr> <dbl> <date> <dbl>
1 1 a 10 2024-01-01 1
2 1 b 10 2024-02-01 2
3 2 c 20 2024-01-01 NA
4 3 d 30 2024-01-01 NA
5 3 e 31 2024-02-01 10
6 3 f 32 2024-02-10 13
7 4 g 40 2024-03-13 NA