rstringdataframereplacefuzzy-search

How, in R, would I replace String Values in one column of a dataframe with string values from another dataframe using a fuzzy match on a 3rd column?


My Google-fu must be failing as I feel someone must've had this problem before and I can't find a solution.

Let's say I have the following data:

df1 <- data.frame(Name = c("Banana", "Apple", "Sponge", "Donut", "Bleach", "Wine"),
                 Value = c("Fruit", "Fruit", "Cleaner", "Dessert", "Cleaner", "Party"))

df2 <- data.frame(Name = c("Ban", "Ap", "Do", "Wi"),
                  Value = c("F", "F", "D", "P"))

I need to be able to do a fuzzy string match on df1$Name with df2$Name, and where they match replace df1$Value with df2$Value where $Name fuzzy matches. If there is no match, I want to retain the value from df1. Such that, my output would look like this:

Name Value
Banana F
Apple F
Sponge Cleaner
Donut D
Bleach Cleaner
Wine P

I want anything starting with "Ban" to be F. I don't care if it says Bananakin or anything after df2$Name. The number of rows in df1 is in the thousands, df2 is merely 67.

Further, df1 has additional columns that aren't relevant but I would like to keep.

The closest code I've found to what I want is this:

df3 <- df1 %>% mutate(across(c(Name), ~if_else(str_detect(Name, df2$Name), str_replace(Value .,df2$Name)))

However I get the following error:

Error in mutate(): ℹ In argument: across(...). Caused by error in str_detect(): ! Can't recycle string (size 1362) to match pattern (size 67). Run rlang::last_trace() to see where the error occurred.

I get the same error for this bit of code as well:

df1$Value <- str_replace_all(df1$Value, df1$Name == df2$Name, df2$Value)

I've also tried:

df1 <- df1 %>% mutate(across(everything(), ~deframe(df2[.])))

What am I missing? I have a feeling it's something simple I'm just not seeing. I'm relatively new to R, and would greatly appreciate the help!


Solution

  • df1 %>%
      mutate(Name1 = Name %>%
               str_replace_all(set_names(df2$Value, str_c(df2$Name, ".*"))),
             Value = ifelse(Name == Name1, Value, Name1),
             Name1 = NULL)
    
       Name   Value
    1 Banana       F
    2  Apple       F
    3 Sponge Cleaner
    4  Donut       D
    5 Bleach Cleaner
    6   Wine       P
    

    fuzzyjoin::regex_left_join(df1, df2, 'Name') %>%
       reframe(Name = Name.x, Value = coalesce(Value.y, Value.x))
    
        Name   Value
    1 Banana       F
    2  Apple       F
    3 Sponge Cleaner
    4  Donut       D
    5 Bleach Cleaner
    6   Wine       P