rdplyrmergefuzzyjoin

dplyr::full_join two data frames with part-match in the "by" argument in R


I would like to join two data sets that look like the following data sets. The matching rule would be that the Item variable from mykey matches the first part of the Item entry in mydata to some degree.

mydata <- tibble(Item = c("ab_kssv", "ab_kd", "cde_kh", "cde_ksa", "cde"), 
             Answer = c(1,2,3,4,5), 
             Avg = rep(-100, length(Item)))

mykey <- tibble(Item = c("ab", "cde"), 
            Avg = c(0 ,10))

The result should be the following:

  Item    Answer   Avg
1 ab_kssv      1     0
2 ab_kd        2     0
3 cde_kh       3    10
4 cde_ksa      4    10
5 cde          5    10

I looked at these three SO questions, but did not find a nice solution there. I also briefly tried the fuzzyjoin package, but that did not work. Finally, I have a for-loop-based solution:

for (currLine in 1:nrow(mydata)) {
mydata$Avg[currLine] <- mykey$Avg[str_starts(mydata$Item[currLine], mykey$Item)]
}

It does the job, but is not nice to read / understand and I wonder if there is a possibility to make the "by" argument of full_join() from the dplyr package a bit more tolerant with its matching. Any help will be apreciated!


Solution

  • Using a fuzzyjoin::regex_left_join you could do:

    Note: I renamed the Item column in your mykey dataset to regex to make clear that this is the regex to match by and added a "^" to ensure that we match at the beginning of the Item column in the mydata dataset.

    library(fuzzyjoin)
    library(dplyr)
    
    mykey <- mykey %>%
      rename(regex = Item) %>%
      mutate(regex = paste0("^", regex))
    
    mydata %>%
      select(-Avg) %>%
      regex_left_join(mykey, by = c(Item = "regex")) %>%
      select(-regex)
    #> # A tibble: 5 × 3
    #>   Item    Answer   Avg
    #>   <chr>    <dbl> <dbl>
    #> 1 ab_kssv      1     0
    #> 2 ab_kd        2     0
    #> 3 cde_kh       3    10
    #> 4 cde_ksa      4    10
    #> 5 cde          5    10