rstring-matchingagrep

match two vectors by similar characters/strings in R


I have two vectors, like

v1<-c("yellow", "red", "orange", "blue", "green")
v2<-c("blues", "redx", "grean")

and I want to match them, i.e., to "link" each element of v1 with the most similar element on v2, so that the result is

> df
      v1    v2
1 yellow  <NA>
2    red  redx
3 orange  <NA>
4   blue blues
5  green grean

The following code gives the expected result, but just because it has manually "formatted" to do so

df<-data.frame(v1,v2=rep(NA,5))

for (i in 1:nrow(df)) {
  
  ag<-agrep(df[i,1], v2, ignore.case = T, value = T)
  
  if (length(ag)==0) {df[i,2]<-NA}
  else if (length(ag)==1) {df[i,2]<-ag}
  else {df[i,2]<-ag[1]}
  
}

It happens that agrep(df[2,1], v2, max.distance = 0.00001, ignore.case = T, value = T) results in "redx" "grean", even if I set max.distance = 0.00001.

That's why I have the if conditions, but it doesn't guarantee that the most similar answer is selected.

How can I overcome this issue?

Thank you in advance


Solution

  • You could try:

    s <- which(adist(v1,v2) <= 1, TRUE) # 1 is the maximum allowed change
    data.frame(v1, v2=replace(NA, s[,1], v2[s[,2]]))
          v1    v2
    1 yellow  <NA>
    2    red  redx
    3 orange  <NA>
    4   blue blues
    5  green grean