I have two dataframes, both have a Last_Name
column. First dataframe has a column Contains_First_Name
and the second has a column called First_Name
. I want to join the two on the exact spelling of Last_Name
and a substring match of Contains_First_Name
and First_Name
(where First_Name
is a substring of Contains_First_Name
.) Please see an example below.
library(dplyr)
library(stringr)
# Create df1
Last_Name <- c("Smith", "Jones", "Adams", "Rogers", "Lee", "Lee", "Lee")
Contains_First_Name <- c("Kimberly Nicole", "Patrick L", "Johnson Ann", "Rick", "McAdams Jennifer Marie", "Kirk", "Kirk B")
Account_Number <- c("123", "345", "678", "901", "234", "567", "890")
df1 <- data.frame(Last_Name, Contains_First_Name, Account_Number)
# Create df2
Last_Name <- c("Smith", "Jones", "Adams", "Lee", "Lee")
First_Name <- c("Kimberly", "Patrick", "Ann", "Jennifer", "Kirk")
df2 <- data.frame(Last_Name, First_Name)
Resulting dataframes:
> df1
Last_Name Contains_First_Name Account_Number
1 Smith Kimberly Nicole 123
2 Jones Patrick L 345
3 Adams Johnson Ann 678
4 Rogers Rick 901
5 Lee McAdams Jennifer Marie 234
6 Lee Kirk 567
7 Lee Kirk B 890
> df2
Last_Name First_Name
1 Smith Kimberly
2 Jones Patrick
3 Adams Ann
4 Lee Jennifer
5 Lee Kirk
What I want to end up with:
> df3
Last_Name Contains_First_Name Account_Number First_Name
1 Smith Kimberly Nicole 123 Kimberly
2 Jones Patrick L 345 Patrick
3 Adams Johnson Ann 678 Ann
4 Lee McAdams Jennifer Marie 234 Jennifer
5 Lee Kirk 567 Kirk
6 Lee Kirk B 890 Kirk
I tried this:
df3 <-
filter(df1,
Last_Name %in% df2$Last_Name,
str_detect(Contains_First_Name, paste(df2$First_Name, collapse = "|")))
Getting the following error:
Error in match.arg(method) : 'arg' must be NULL or a character vector
I also explored fuzzyjoin
library but could not figure out how to join on two variables with two different join types (exact and substring.) I saw a similar question which does not appear to have an answer: Merge two dataframes based on an exact match in one column and match within an error in another column in R.
Any advice is greatly appreciated. Thank you.
I'd say you have two options: either use an equi-join on only the first column and filter later or use fuzzyjoin
, as you described:
# Approach 1: Match all, filter later
inner_join(df1, df2, join_by(Last_Name), relationship = "many-to-many") |>
filter(str_detect(Contains_First_Name, First_Name))
#> # A tibble: 6 × 4
#> Last_Name Contains_First_Name Account_Number First_Name
#> <chr> <chr> <chr> <chr>
#> 1 Smith Kimberly Nicole 123 Kimberly
#> 2 Jones Patrick L 345 Patrick
#> 3 Adams Johnson Ann 678 Ann
#> 4 Lee McAdams Jennifer Marie 234 Jennifer
#> 5 Lee Kirk 567 Kirk
#> 6 Lee Kirk B 890 Kirk
# Approach 2: fuzzyjoin
fuzzyjoin::fuzzy_inner_join(
df1,
df2,
by = c("Last_Name" = "Last_Name", "Contains_First_Name" = "First_Name"),
match_fun = list(`==`, \(x, y) str_detect(x, y))
) |>
select(!Last_Name.y) |>
rename(Last_Name = Last_Name.x)
#> # A tibble: 6 × 4
#> Last_Name Contains_First_Name Account_Number First_Name
#> <chr> <chr> <chr> <chr>
#> 1 Smith Kimberly Nicole 123 Kimberly
#> 2 Jones Patrick L 345 Patrick
#> 3 Adams Johnson Ann 678 Ann
#> 4 Lee McAdams Jennifer Marie 234 Jennifer
#> 5 Lee Kirk 567 Kirk
#> 6 Lee Kirk B 890 Kirk
Created on 2024-01-06 with reprex v2.0.2