I would like to compare two mixed-type data frames and return the rows that are different between them--but I would like numeric values to only be returned within a certain percentage.
tbl1 <- tibble(var1 = c('r1', 'r2', 'r3', 'r4', 'r5'),
var2 = c('apple', 'orange', 'banana', 'strawberry', 'lime'),
var3 = c(1, 2, 3, 4, 5),
var4 = c('yes', 'no', 'yes', 'yes', 'no'))
tbl2 <- tibble(var1 = c('r6', 'r7', 'r8', 'r9', 'r10'),
var2 = c('orange', 'banana', 'apple', 'lemon', 'strawberry'),
var3 = c(2, 3, 1.5, 10, 4.1),
var4 = c('no', 'yes', 'yes', 'no', 'yes'))
I know there is dplyr::anti_join
but that checks for exact matches. So if I was OK with numeric values that were within 20%, then the function would be something like:
tbl1 %>%
antijoin_function(tbl2, by = c('var2' = 'var2', 'var3' = 'var3', 'var4' = 'var4'),
pct = 0.2)
And return
var1 | var2 | var3 | var4 |
---|---|---|---|
r1 | apple | 1 | yes |
r5 | lime | 5 | no |
The row with strawberry
would not be returned because the single difference in var3
is less than 20%.
Are there any functions or packages that do this?
library(dplyr)
full_join(tbl1, tbl2, by = c("var2" = "var2"), suffix = c("", ".right")) %>%
filter(abs(var3 - var3.right)/var3 > 0.2 | if_all(contains(".right"), ~ is.na(.))) %>%
select(-contains(".right"))
#> # A tibble: 2 × 4
#> var1 var2 var3 var4
#> <chr> <chr> <dbl> <chr>
#> 1 r1 apple 1 yes
#> 2 r5 lime 5 no
Created on 2023-05-22 with reprex v2.0.2