How to merge based on a string in a column?

I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")

df1 <- data.frame(year, state, name)

year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("@ george v. SALLY", "@laura v. dAvid")

df2 <- data.frame(year, state, versus)

My preferred output would be the following:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("@ george v. SALLY", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)

df3 <- data.frame(year, state, name, versus)

I've tried variations of the following:

library(fuzzyjoin)

stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")

stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")

And they don't seem to get close to what I want.

I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.

Solution

A simple approach, which depends somewhat on the structure of df2$versus, would be this:

library(dplyr)

left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))

Output:

  year  state name   versus           
  <chr> <chr> <chr>  <chr>            
1 2002  TN    George @ george v. SALLY
2 2002  TN    Sally  @ george v. SALLY
3 1999  AL    David  @laura v. dAvid  
4 1999  AL    Laura  @laura v. dAvid  
5 1997  CA    John   NA               
6 2002  TN    Kate   NA

Update/Jul 14 2022:

If name has more complicated pattern, rather than a single word (say Molly Home, Jane Doe), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:

Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0

f <- function(n,v) {
  wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
  sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}

Left join the original frames, and apply f() by row

left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(f(name, versus), versus,NA_character_))

Output:

1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
2 2002  TN    Sally                 NA              
3 1999  AL    David                 @laura v. dAvid 
4 1999  AL    Laura                 @laura v. dAvid 
5 1997  CA    John                  NA              
6 2002  TN    Kate                  NA

Input:

df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997", 
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe", 
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA, 
-6L))

df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "@laura v. dAvid")), class = "data.frame", row.names = c(NA, 
-2L))