rmergestring-matchingfuzzy-searchfuzzy-comparison

merge two data.frame using a column with the same strings but in different order


I am trying to merge two data.frames using a column that contains strings. The strings in the two columns are names, unfortunately, they are not in the same order. In the example below, names in df_1 have the structure "name"+"midname"+"surname1"+"surname2" while in df_2 the structure is "surname1"+"surname2"+"name"+"midname".

I first tried to do a fuzzy merge using the names. However, it doesn't solve the problem since there are still non-zero matches between totally different names. Additionally, it is non-trivial to define a cutting point that can define when a name is totally different from another. I was also expecting a higher degree of similarity between names with reverse order (i.e., (name+midname) + (surname1+surname2) in a different order).

Do you have a better way to merge the two data.frame using these names in a different order? Thanks in advance.

# "name"+"midname"+"surname1"+"surname2
df_1<- read.table(header = T,sep = "\t", text = "
name
Tetsurō Shoyo Hinata Kuroo
Kōtarō Tobio Kageyama Bokuto
Wakatoshi Daichi Sawamura Ushijima
Tōru  Tsukishima Oikawa
Yūji  Azumane Terushima
Kenma Kozume
")
# "surname1"+"surname2"+"name"+"midname".
df_2<- read.table(header = T,sep = "\t", text = "
name
Hinata Kuroo Tetsurō Shoyo
Kageyama Bokuto Kōtarō Tobio
Sawamura Ushijima Wakatoshi Daichi
Tsukishima Oikawa Tōru 
Azumane Terushima Yūji 
Kiyoomi Sakusa
")
library(fuzzyjoin); library(dplyr);
stringdist_join(df_1, df_2, 
                by = "name",
                mode = "inner",
                ignore_case = FALSE, 
                method = "jw", 
                max_dist = 99, 
                distance_col = "dist") %>%
  group_by(name.x) %>%
  slice_min(order_by = dist, n = 1)

Result

# A tibble: 6 x 3
# Groups:   name.x [6]
name.x                          name.y                           dist
<chr>                           <chr>                           <dbl>
1 Kenma Kozume                    "Azumane Terushima Yuji "       0.416
2 Kotaro Tobio Kageyama Bokuto    "Kageyama Bokuto Kotaro Tobio"  0.241
3 Tetsuro Shoyo Hinata Kuroo      "Kageyama Bokuto Kotaro Tobio"  0.351
4 Toru  Tsukishima Oikawa         "Tsukishima Oikawa Toru "       0.302
5 Wakatoshi Daichi Sawamura Ushi~ "Sawamura Ushijima Wakatoshi D~ 0.366
6 Yuji  Azumane Terushima         "Azumane Terushima Yuji "       0.283

Solution

  • You can strsplit to individual names, sort them and paste. Then use match.

    x <- sapply(strsplit(df_1$name, " +"), function(x) paste(sort(x), collapse = " "))
    y <- sapply(strsplit(df_2$name, " +"), function(x) paste(sort(x), collapse = " "))
    
    cbind(df_1$name, df_2$name[match(x, y)])
    #     [,1]                                 [,2]                                
    #[1,] "Tetsurō Shoyo Hinata Kuroo"         "Hinata Kuroo Tetsurō Shoyo"        
    #[2,] "Kōtarō Tobio Kageyama Bokuto"       "Kageyama Bokuto Kōtarō Tobio"      
    #[3,] "Wakatoshi Daichi Sawamura Ushijima" "Sawamura Ushijima Wakatoshi Daichi"
    #[4,] "Tōru  Tsukishima Oikawa"            "Tsukishima Oikawa Tōru "           
    #[5,] "Yūji  Azumane Terushima"            "Azumane Terushima Yūji "           
    #[6,] "Kenma Kozume"                       NA