I'm attempting to join two tables, one is a smaller table with a column of names of common food items (e.g. "Corn", "Peppers", "Squash"...etc...), and the other is a larger table with specific food names (e.g. "Sweet Corn", "Red Corn", "Baby Corn", "Zucchini Squash", "Orange Squash", "Squash , Large"...etc...). I'm trying to join based on a "fuzzy" condition in which I specify to join on the food names and pull the most frequent code in another column of the larger table (the mode) into a new column in the smaller table.
Here is an example of the smaller table:
Food Name | Food Code |
---|---|
Corn | NA |
Squash | NA |
Peppers | NA |
Here is an example of the larger table:
Food Name | Food Code |
---|---|
Sweet Corn | 532 |
Red Corn | 532 |
Baby Corns | 944 |
Squash | 111 |
Long Squash | 123 |
Red Pepper | 654 |
Green Pepper | 655 |
Red Peppers | 654 |
I've tried the "left_join" function from the dplyr package, but this doesn't seem to work that well with the "fuzzy" string join feature. I know that the tidyverse also has a function to find the mode of grouped variables and I was hoping to use that function, but I am unsure how to incorporate that into the left_join statement. I also discovered the fuzzyjoin package in R, but I am not certain if this is the best option.
My desired output would look like:
Food Name | Food Code |
---|---|
Corn | 532 |
Squash | 111 |
Peppers | 654 |
I hope this helps you.
In stringdist_join
, the max_dist
argument is used to constrain the degree of fuzziness.
library(fuzzyjoin)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(knitr)
small_tab = data.frame(Food.Name = c('Corn', 'Squash', 'Peppers'),
Food.Code = c(NA, NA, NA))
large_tab = data.frame(Food.Name = c('Sweet Corn', 'Red Corn', 'Baby Corns',
'Squash', 'Long Squash', 'Red Pepper',
'Green Pepper', 'Red Peppers'),
Food.Code = c(532, 532, 944, 111, 123, 654, 655, 654))
joined_tab = stringdist_join(small_tab, large_tab, by = 'Food.Name',
ignore_case = TRUE, method = 'cosine',
max_dist = 0.5, distance_col = 'dist') %>%
# Tidy columns
select(Food.Name = Food.Name.x, -Food.Name.y,
Food.Code = Food.Code.y, -dist) %>%
# Only keep most frequent food code per food name
group_by(Food.Name) %>% count(Food.Name, Food.Code) %>%
slice(which.max(n)) %>% select(-n) %>%
# Order food names as in the small table
arrange(factor(Food.Name, levels = small_tab$Food.Name))
# Show table with columns renamed
joined_tab %>%
rename('Food Name' = Food.Name,
'Food Code' = Food.Code) %>%
kable()
Food Name | Food Code |
---|---|
Corn | 532 |
Squash | 111 |
Peppers | 654 |
Created on 2023-05-31 with reprex v2.0.2