rstringdplyrfuzzyjoin

Table joins with conditional "fuzzy" string matching in R


I'm attempting to join two tables, one is a smaller table with a column of names of common food items (e.g. "Corn", "Peppers", "Squash"...etc...), and the other is a larger table with specific food names (e.g. "Sweet Corn", "Red Corn", "Baby Corn", "Zucchini Squash", "Orange Squash", "Squash , Large"...etc...). I'm trying to join based on a "fuzzy" condition in which I specify to join on the food names and pull the most frequent code in another column of the larger table (the mode) into a new column in the smaller table.

Here is an example of the smaller table:

Food Name Food Code
Corn NA
Squash NA
Peppers NA

Here is an example of the larger table:

Food Name Food Code
Sweet Corn 532
Red Corn 532
Baby Corns 944
Squash 111
Long Squash 123
Red Pepper 654
Green Pepper 655
Red Peppers 654

I've tried the "left_join" function from the dplyr package, but this doesn't seem to work that well with the "fuzzy" string join feature. I know that the tidyverse also has a function to find the mode of grouped variables and I was hoping to use that function, but I am unsure how to incorporate that into the left_join statement. I also discovered the fuzzyjoin package in R, but I am not certain if this is the best option.

My desired output would look like:

Food Name Food Code
Corn 532
Squash 111
Peppers 654

Solution

  • I hope this helps you.

    In stringdist_join, the max_dist argument is used to constrain the degree of fuzziness.

    library(fuzzyjoin)
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    library(knitr)
    
    
    small_tab = data.frame(Food.Name = c('Corn', 'Squash', 'Peppers'), 
                           Food.Code = c(NA, NA, NA))
    
    
    large_tab = data.frame(Food.Name = c('Sweet Corn', 'Red Corn', 'Baby Corns', 
                                         'Squash', 'Long Squash', 'Red Pepper', 
                                         'Green Pepper', 'Red Peppers'), 
                           Food.Code = c(532, 532, 944, 111, 123, 654, 655, 654))
    
    joined_tab = stringdist_join(small_tab, large_tab, by = 'Food.Name',
                                 ignore_case = TRUE, method = 'cosine', 
                                 max_dist = 0.5, distance_col = 'dist') %>%
      
      # Tidy columns 
      select(Food.Name = Food.Name.x, -Food.Name.y, 
             Food.Code = Food.Code.y, -dist) %>%
      
      # Only keep most frequent food code per food name
      group_by(Food.Name) %>% count(Food.Name, Food.Code) %>% 
      slice(which.max(n)) %>% select(-n) %>%
      
      # Order food names as in the small table
      arrange(factor(Food.Name, levels = small_tab$Food.Name))
    
    # Show table with columns renamed
    joined_tab %>%
      rename('Food Name' = Food.Name, 
             'Food Code' = Food.Code) %>%
      kable()
    
    Food Name Food Code
    Corn 532
    Squash 111
    Peppers 654

    Created on 2023-05-31 with reprex v2.0.2