r

Replicate Index/Match in Tidyverse with date range and multiple criteria


I'm trying to match two datasets that have a common unique ID. One has a date and the other has date ranges. How do I get values from one into the other where the unique ID matches and the date is within the date range?

I tried case_when and it does work IF they are the same size, but the two datasets are not, so I'm stumped on what to do.

It doesn't necessarily have to be a tidyverse solution

library(tidyverse)

dates <- c("2024-01-01", "2024-03-27", "2024-05-15")
unique_keys <- c(123, 456, 789)
value1 <- c(10, 30, 20)

data_df <- tibble(dates, unique_keys, value1)
data_df

unique_keys <- c(123, 456)
start_date <- c("2023-12-15", "2024-04-01")
end_date <- c("2024-01-02", "2024-06-01")
value2 <- c(11, 28)

key_df <- tibble(unique_keys, start_date, end_date, value2)
key_df

# Goal: 
# Take value2 from the key_df and bring it into data_df when both the unique_key matches and the date is within the range

# In the example dataset:
# data_df row1 matches up with key_df row 1, so I want a new column in data_df that shows 11
# data_df row2 unique_keys matches with key_df row 2, BUT the date is wrong. Can just show NA

data_df %>% 
  mutate(value2 = case_when(
    data_df$unique_keys == key_df$unique_keys & data_df$dates >= key_df$start_date & data_df$dates <= key_df$end_date ~ key_df$value2
  ))

Solution

  • You are probably looking for an overlap joins (see R4ds). From your example, I changed dates expressed in characters to date representations with lubridate::ymd() and used a full_join() with join_by(unique_keys,between(dates, start_date, end_date)) :

    data_df %>% 
      full_join( key_df, 
                 join_by(unique_keys,between(dates, start_date, end_date)))
    

    You might prefer an inner_join() or a left_join()

    Full reprex:

    library(tidyverse)
    dates <- c("2024-01-01", "2024-03-27", "2024-05-15")
    unique_keys <- c(123, 456, 789)
    value1 <- c(10, 30, 20)
    
    data_df <- tibble(dates, unique_keys, value1) %>% 
      mutate(dates = ymd(dates), 
             dates = ymd(dates))
    data_df
    #> # A tibble: 3 × 3
    #>   dates      unique_keys value1
    #>   <date>           <dbl>  <dbl>
    #> 1 2024-01-01         123     10
    #> 2 2024-03-27         456     30
    #> 3 2024-05-15         789     20
    
    unique_keys <- c(123, 456)
    start_date <- c("2023-12-15", "2024-04-01")
    end_date <- c("2024-01-02", "2024-06-01")
    value2 <- c(11, 28)
    
    key_df <- tibble(unique_keys, start_date, end_date, value2) %>% 
      mutate(start_date = ymd(start_date), 
             end_date = ymd(end_date))
    key_df
    #> # A tibble: 2 × 4
    #>   unique_keys start_date end_date   value2
    #>         <dbl> <date>     <date>      <dbl>
    #> 1         123 2023-12-15 2024-01-02     11
    #> 2         456 2024-04-01 2024-06-01     28
    
    
    # See https://r4ds.hadley.nz/joins#overlap-joins 
    
    data_df %>% 
      full_join( key_df, 
                 join_by(unique_keys,between(dates, start_date, end_date)))
    #> # A tibble: 4 × 6
    #>   dates      unique_keys value1 start_date end_date   value2
    #>   <date>           <dbl>  <dbl> <date>     <date>      <dbl>
    #> 1 2024-01-01         123     10 2023-12-15 2024-01-02     11
    #> 2 2024-03-27         456     30 NA         NA             NA
    #> 3 2024-05-15         789     20 NA         NA             NA
    #> 4 NA                 456     NA 2024-04-01 2024-06-01     28