I'm trying to match two datasets that have a common unique ID. One has a date and the other has date ranges. How do I get values from one into the other where the unique ID matches and the date is within the date range?
I tried case_when and it does work IF they are the same size, but the two datasets are not, so I'm stumped on what to do.
It doesn't necessarily have to be a tidyverse solution
library(tidyverse)
dates <- c("2024-01-01", "2024-03-27", "2024-05-15")
unique_keys <- c(123, 456, 789)
value1 <- c(10, 30, 20)
data_df <- tibble(dates, unique_keys, value1)
data_df
unique_keys <- c(123, 456)
start_date <- c("2023-12-15", "2024-04-01")
end_date <- c("2024-01-02", "2024-06-01")
value2 <- c(11, 28)
key_df <- tibble(unique_keys, start_date, end_date, value2)
key_df
# Goal:
# Take value2 from the key_df and bring it into data_df when both the unique_key matches and the date is within the range
# In the example dataset:
# data_df row1 matches up with key_df row 1, so I want a new column in data_df that shows 11
# data_df row2 unique_keys matches with key_df row 2, BUT the date is wrong. Can just show NA
data_df %>%
mutate(value2 = case_when(
data_df$unique_keys == key_df$unique_keys & data_df$dates >= key_df$start_date & data_df$dates <= key_df$end_date ~ key_df$value2
))
You are probably looking for an overlap joins (see R4ds). From your example, I changed dates expressed in characters to date representations with lubridate::ymd()
and used a full_join()
with join_by(unique_keys,between(dates, start_date, end_date))
:
data_df %>%
full_join( key_df,
join_by(unique_keys,between(dates, start_date, end_date)))
You might prefer an inner_join()
or a left_join()
Full reprex:
library(tidyverse)
dates <- c("2024-01-01", "2024-03-27", "2024-05-15")
unique_keys <- c(123, 456, 789)
value1 <- c(10, 30, 20)
data_df <- tibble(dates, unique_keys, value1) %>%
mutate(dates = ymd(dates),
dates = ymd(dates))
data_df
#> # A tibble: 3 × 3
#> dates unique_keys value1
#> <date> <dbl> <dbl>
#> 1 2024-01-01 123 10
#> 2 2024-03-27 456 30
#> 3 2024-05-15 789 20
unique_keys <- c(123, 456)
start_date <- c("2023-12-15", "2024-04-01")
end_date <- c("2024-01-02", "2024-06-01")
value2 <- c(11, 28)
key_df <- tibble(unique_keys, start_date, end_date, value2) %>%
mutate(start_date = ymd(start_date),
end_date = ymd(end_date))
key_df
#> # A tibble: 2 × 4
#> unique_keys start_date end_date value2
#> <dbl> <date> <date> <dbl>
#> 1 123 2023-12-15 2024-01-02 11
#> 2 456 2024-04-01 2024-06-01 28
# See https://r4ds.hadley.nz/joins#overlap-joins
data_df %>%
full_join( key_df,
join_by(unique_keys,between(dates, start_date, end_date)))
#> # A tibble: 4 × 6
#> dates unique_keys value1 start_date end_date value2
#> <date> <dbl> <dbl> <date> <date> <dbl>
#> 1 2024-01-01 123 10 2023-12-15 2024-01-02 11
#> 2 2024-03-27 456 30 NA NA NA
#> 3 2024-05-15 789 20 NA NA NA
#> 4 NA 456 NA 2024-04-01 2024-06-01 28