rdataframeleft-joinmatchingfuzzyjoin

fuzzy_join of two dataframes based on ZIP codes


I am trying to fuzzyjoin two dataframes. Both contain the column with ZIP codes and some other columns. However, in the parental dataframe there are more ZIP codes than in the secondary one. I would like to match in particular based on the first 3 integer values of a ZIP code and then the closest numerical value of the ZIP. For that JaroWinker distance seems to be perfect.

I have tried to use the solution given here using comparator package.

adapted to my case as follows:

library(dplyr)

library(fuzzyjoin)

library(tidyverse)
library(comparator)

f1 <- tribble(
  ~colA, ~colB,
  3000,   1,
  3001,   2,
  3007,   3
)


df2 <- tribble(
  ~colA, ~colC,
  3000,   200,
  3004,   22,
  3012,   55
)


jw <- comparator::JaroWinkler()

df3 <- fuzzyjoin::fuzzy_left_join(
  x = df1, y = df2, by = "colA",
  match_fun = function(x, y) { jw(x, y) > 0.62}
) 

But what I obtain as the output is a df with 9 rows, but I just want a new df that would like this

df3 <- tribble(
  ~colA, ~colB, ~colC,
  3000,  1,  200,
  3001,  2,  200,
  3012, 3,   22
)

i.e. taking into account both the JW distance and also that 3001 is numerically closer to 3000 than to 3004 and that 3007 is closer to 3004 then to 3012.

Any hints how should I modify my code? Thank you!!


Solution

  • With fuzzyjoin::fuzzy_join() & co we can use custom matching functions. In this case we could aim for an exact match for the first digit(s) and numeric difference for the rest.

    There will likely be more than one match, but we could use numeric difference from matching function with slice_min() to keep just a single match for each df1$colA value.

    library(dplyr, warn.conflicts = FALSE)
    library(fuzzyjoin)
    
    df1 <- tribble(
      ~colA, ~colB,
      3000,   1,
      3001,   2,
      3007,   3,
      2820,   4,
    )
    
    df2 <- tribble(
      ~colA, ~colC,
      2999,   100,
      3000,   200,
      3004,   22,
      3012,   55,
    )
    
    # if match_fun returns a data.frame / tibble, it's first column is
    # used as match indicator; fuzzy_join adds extra columns to resulting frame
    zip_match <- function(x, y, match_digits = 1, max_num_dist = 999){
      tibble(
        start_match = stringi::stri_sub(x, to = match_digits) == stringi::stri_sub(y, to = match_digits),
        num_dist    = abs(as.numeric(stringi::stri_sub(x, from = match_digits + 1)) - 
                          as.numeric(stringi::stri_sub(y, from = match_digits + 1))),
        match = start_match & num_dist <= max_num_dist
      ) |> 
      select(match, start_match, num_dist)
    }
    

    Usage examples:

    # call match_fun with defaults (match_digits = 1, max_num_dist = 999),
    # for 4-digit codes only first digit must match, "2820" matches "2999" 
    fuzzy_left_join(df1, df2, by = "colA", match_fun = zip_match) |> 
      # keep only single closest match
      slice_min(order_by = num_dist, by = colA.x)
    #> # A tibble: 4 × 6
    #>   colA.x  colB colA.y  colC start_match num_dist
    #>    <dbl> <dbl>  <dbl> <dbl> <lgl>          <dbl>
    #> 1   3000     1   3000   200 TRUE               0
    #> 2   3001     2   3000   200 TRUE               1
    #> 3   3007     3   3004    22 TRUE               3
    #> 4   2820     4   2999   100 TRUE             179
    
    # match_digits = 2, "2820" does not match "2999"
    fuzzy_left_join(df1, df2, by = "colA", match_fun = zip_match, match_digits = 2) |> 
      slice_min(order_by = num_dist, by = colA.x)
    #> # A tibble: 4 × 6
    #>   colA.x  colB colA.y  colC start_match num_dist
    #>    <dbl> <dbl>  <dbl> <dbl> <lgl>          <dbl>
    #> 1   3000     1   3000   200 TRUE               0
    #> 2   3001     2   3000   200 TRUE               1
    #> 3   3007     3   3004    22 TRUE               3
    #> 4   2820     4     NA    NA NA                NA
    

    Created on 2025-01-18 with reprex v2.1.1


    1st revision with difference_join() for reference - https://stackoverflow.com/revisions/79367476/1