sqlrjoindata-manipulationfuzzy-logic

R: "Fuzzy Match" and "Between" Statements


I am working with the R Programming Language. I have the following tables (note: all variables appear as "Factors"):

table_1 = data.frame(id = c("123", "123", "125", "C125-B"), 
date_1 = c("2010-01-31","2010-01-31", "2016-01-31", "2018-01-31" ))

table_1$id = as.factor(table_1$id)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id = c("5123", "123 A", "125", "125"), 
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id = as.factor(table_2$id)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)


> table_1
      id     date_1
1    123 2010-01-31
2    123 2010-01-31
3    125 2016-01-31
4 C125-B 2018-01-31

 table_2
     id     date_2     date_3
1  5123 2009-01-31 2011-01-31
2 123 A 2010-01-31 2010-01-31
3   125 2010-01-31 2020-01-31
4   125 2010-01-31 2020-01-31

I am trying to "join" (e.g. inner join) this tables on the following conditions:

1) if table_1$id "fuzzy equal" table_2$id

AND

2) if table_1$date BETWEEN(table_2$date_2,table_2$date_3)

I tried to write the following code in R to do this:

library(fuzzyjoin)
stringdist_inner_join(table_1, table_2,
                      by ="id", distance_col = NULL)

Question: But I am not sure if the stringdist_inner_join function can accommodate this kind of "between" logic.

Can someone please show me how to do this? Are there any other methods to accomplish this in R?

Thanks!


Solution

  • How about this? We could do the stringdist_inner_join and filter afterwards if the dates are stored as dates. This should be plenty performant for most data, and if not you should probably use data.table instead of fuzzyjoin.

    library(fuzzyjoin)
    library(dplyr)
    table_1$date_1 = as.Date(table_1$date_1)
    table_2$date_2 = as.Date(table_2$date_2)
    table_2$date_3 = as.Date(table_2$date_3)
    stringdist_inner_join(table_1, table_2, by = "id", max_dist = 2) %>%
      filter(date_1 >= date_2, date_1 <= date_3)
    
    
       id.x     date_1  id.y     date_2     date_3
    1   123 2010-01-31  5123 2009-01-31 2011-01-31
    2   123 2010-01-31 123 A 2010-01-31 2010-01-31
    3   123 2010-01-31   125 2010-01-31 2020-01-31
    4   123 2010-01-31   125 2010-01-31 2020-01-31
    5   123 2010-01-31  5123 2009-01-31 2011-01-31
    6   123 2010-01-31 123 A 2010-01-31 2010-01-31
    7   123 2010-01-31   125 2010-01-31 2020-01-31
    8   123 2010-01-31   125 2010-01-31 2020-01-31
    9   125 2016-01-31   125 2010-01-31 2020-01-31
    10  125 2016-01-31   125 2010-01-31 2020-01-31