rstring-matchingfuzzy-searchagrepfuzzyjoin

Partial string matching in R and trim the characters


Here is a dataframe and a vector.

df1  <-  tibble(var1 = c("abcd", "efgh", "ijkl", "mnopqr", "qrst"))
vec <-  c("ab", "mnop", "ijk")

Now, for all the values in var1 that matches closest (I would like to match the first n characters) with the values in vec, keep only upto first 3 characters of vec in var1 such that the desired solution is:

df2 <- tibble(var1 = c("ab", "efgh", "ijk", "mno", "qrst"))

Since, "abcd" matches closest with "ab" in vec, we keep only upto 3 characters of "ab" i.e. 2 in this case, in df2, but "efgh" doesn't exist in vec, so we keep it as is i.e "efgh" in df2 and so on.

Can I use dplyr, stringr, fuzzyjoin, agrep, or fuzzywuzzyr to accomplish this? You may want to build upon the following suggested here https://stackoverflow.com/a/51053674/6762788, thanks to Psidom.

df1 %>% 
    mutate(var1 = ifelse(var1 %in% vec, substr(var1, 1, 3), var1))

Solution

  • df1 <- tibble(var1 = c("abcd", "efgh", "ijkl", "mnopqr", "qrst","mnopr"))
    
    a = which(adist(vec,df1$var1,partial = T,ignore.case = T)==0,T)
    
    df1%>%
      mutate(var1=replace(var1,a[,2],substr(vec[a[,1]],1,3)))
    # A tibble: 6 x 1
      var1 
      <chr>
    1 ab   
    2 efgh 
    3 ijk  
    4 mno  
    5 qrst 
    6 mno