r

merge based on id and dates with missing data


I'm attempting to merge datasets based on drug and dates.

The left (df1) is a long claim dataset where people (id) can appear multiple times with different claim ids, drugs (drug), and dates (claim_date).

The right (df2) is a long dataset of starting dates (price_date) for effective drug prices (price). The price is effective until there is a new date.

I want to keep all records on the left (df1). Some drugs like 40 do not have pricing data in df2. Other drug have claims that occur before the first effective price. These should all result in NA when merged.

The real dataset has a few hundred thousand claims and maybe 40k pricing rows, so my tidyverse approaches were slow and not correct. I don't know data.table well, but I kept getting incorrect matches and unexpected NAs for the left hand side.

library(tidyverse)

# claims
df1 <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
              claim = c("a", "b", "c", "d", "e", "f", "g"),
              drug = c(10, 10, 20, 30, 31, 32, 40),
              claim_date = ymd("2024-01-01", "2024-02-01",
                               "2024-01-01",
                               "2024-01-01", "2024-02-01", "2024-02-10",
                               "2024-03-13"))

# price
# no price data for drug 4
df2 <- tibble(drug = c(rep(10, 3), rep(20, 3), rep(30, 3),
                       rep(31, 3), rep(32, 3)),
              price = c(1, 2, 3,     #10
                        4, 5, 6,     #20
                        7, 8, 9,     #30
                        10, 11, 12,  #31
                        13, 14, 15   #32
                        ),
              price_date = ymd("2024-01-01", "2024-02-01", "2024-03-01", #10
                               "2024-01-10", "2024-02-11", "2024-03-12", #20
                               "2024-01-20", "2024-02-21", "2024-03-22", #30
                               "2024-01-20", "2024-02-21", "2024-03-22", #31
                               "2024-01-20", "2024-02-21", "2024-03-22") #32
              )

result <- tibble(id = c(1, 1, 2, 3, 3, 3, 4),
                 claim = c("a", "b", "c", "d", "e", "f", "g"),
                 drug = c(10, 10, 20, 30, 31, 32, 40),
                 claim_date = ymd("2024-01-01", "2024-02-01", #10
                                  "2024-01-01", #20
                                  "2024-01-01", #30
                                  "2024-02-01", #31
                                  "2024-02-10", #32
                                  "2024-03-13"  #40
                                  ),
                 price = c(1, 2,     # 10
                           NA,       # 20; claim before first price date
                           NA,       # 30
                           10,       # 31
                           13,       # 32
                           NA        # 40; no drug price info
                           ))
df1
# A tibble: 7 × 4
     id claim  drug claim_date
  <dbl> <chr> <dbl> <date>    
1     1 a        10 2024-01-01
2     1 b        10 2024-02-01
3     2 c        20 2024-01-01
4     3 d        30 2024-01-01
5     3 e        31 2024-02-01
6     3 f        32 2024-02-10
7     4 g        40 2024-03-13

# A tibble: 15 × 3
    drug price price_date
   <dbl> <dbl> <date>    
 1    10     1 2024-01-01
 2    10     2 2024-02-01
 3    10     3 2024-03-01
 4    20     4 2024-01-10
 5    20     5 2024-02-11
 6    20     6 2024-03-12
 7    30     7 2024-01-20
 8    30     8 2024-02-21
 9    30     9 2024-03-22
10    31    10 2024-01-20
11    31    11 2024-02-21
12    31    12 2024-03-22
13    32    13 2024-01-20
14    32    14 2024-02-21
15    32    15 2024-03-22

> result
# A tibble: 7 × 5
     id claim  drug claim_date price
  <dbl> <chr> <dbl> <date>     <dbl>
1     1 a        10 2024-01-01     1
2     1 b        10 2024-02-01     2
3     2 c        20 2024-01-01    NA
4     3 d        30 2024-01-01    NA
5     3 e        31 2024-02-01    10
6     3 f        32 2024-02-10    13
7     4 g        40 2024-03-13    NA

Solution

  • If df2 is ordered, perhaps you can use a temporary variable i:

    library(dplyr)
    df1 |> 
      left_join(mutate(df2, i=row_number()), by=join_by(drug, claim_date>=price_date)) |> 
      slice_max(i, n=1, by=claim) |>
      select(-price_date, -i) 
    

    giving

    # A tibble: 7 × 5
         id claim  drug claim_date price
      <dbl> <chr> <dbl> <date>     <dbl>
    1     1 a        10 2024-01-01     1
    2     1 b        10 2024-02-01     2
    3     2 c        20 2024-01-01    NA
    4     3 d        30 2024-01-01    NA
    5     3 e        31 2024-02-01    10
    6     3 f        32 2024-02-10    13
    7     4 g        40 2024-03-13    NA