rjoinmergefuzzyjoin

How to merge based on a string in a column?


I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")

df1 <- data.frame(year, state, name)

year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("@ george v. SALLY", "@laura v. dAvid")

df2 <- data.frame(year, state, versus)

My preferred output would be the following:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("@ george v. SALLY", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)

df3 <- data.frame(year, state, name, versus)

I've tried variations of the following:

library(fuzzyjoin)

stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")

stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")

And they don't seem to get close to what I want.

I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.


Solution

  • A simple approach, which depends somewhat on the structure of df2$versus, would be this:

    library(dplyr)
    
    left_join(df1,df2, by=c("year","state")) %>% 
      rowwise() %>% 
      mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
    

    Output:

      year  state name   versus           
      <chr> <chr> <chr>  <chr>            
    1 2002  TN    George @ george v. SALLY
    2 2002  TN    Sally  @ george v. SALLY
    3 1999  AL    David  @laura v. dAvid  
    4 1999  AL    Laura  @laura v. dAvid  
    5 1997  CA    John   NA               
    6 2002  TN    Kate   NA             
    

    Update/Jul 14 2022:

    If name has more complicated pattern, rather than a single word (say Molly Home, Jane Doe), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:

    1. Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
    f <- function(n,v) {
      wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
      sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
    }
    

    Left join the original frames, and apply f() by row

    left_join(df1,df2, by=c("year","state")) %>% 
      rowwise() %>% 
      mutate(versus:=if_else(f(name, versus), versus,NA_character_))
    

    Output:

    1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
    2 2002  TN    Sally                 NA              
    3 1999  AL    David                 @laura v. dAvid 
    4 1999  AL    Laura                 @laura v. dAvid 
    5 1997  CA    John                  NA              
    6 2002  TN    Kate                  NA              
    

    Input:

    df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997", 
    "2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe", 
    "Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA, 
    -6L))
    
    df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
    ), versus = c("Homes (v. Vista)", "@laura v. dAvid")), class = "data.frame", row.names = c(NA, 
    -2L))