rtidyversepurrr

How to most efficiently filter a dataframe conditionally of values in another one, in the tidyverse framework?


I have a dataframe df1 with an ID column and a lubridate time interval column, and I want to filter (subsample) a dataframe df2, which has ID and DateTime columns, so that only df2 rows with DateTime fitting the corresponding ID interval in df1 are kept. I want to do so in a tidyverse framework.

It can easily be done using a join (see example below), but I would like to know whether there would be a more direct solution (maybe purrr-based) that would avoid joining and then removing the time-interval data from the second dataframe. Thanks.

The question posted here Merge two dataframes if timestamp of x is within time interval of y is close to the one asked here but proposed solution were similar to the one I developed and not in a tidyverse framework.

A minimal code to show the problem and my current solution:

library(tibble)  
library(lubridate)

df1 <- tribble(
  ~ID, ~Date1, ~Date2,
  "ID1", "2018-04-16", "2018-06-14",
  "ID2", "2018-04-20", "2018-06-25") 
df1 <- mutate(df1,Interval = interval(ymd(Date1),ymd(Date2)))

df2 <- tribble(
  ~ID, ~DateTime,
  "ID1", "2018-04-12",
  "ID1", "2018-05-05",
  "ID2", "2018-04-23",
  "ID2", "2018-07-12")
df2 <- mutate(df2,DateTime=ymd(DateTime)) 

df1 looks like this

> df1
# A tibble: 2 x 4
  ID    Date1      Date2      Interval                      
  <chr> <chr>      <chr>      <S4: Interval>                
1 ID1   2018-04-16 2018-06-14 2018-04-16 UTC--2018-06-14 UTC
2 ID2   2018-04-20 2018-06-25 2018-04-20 UTC--2018-06-25 UTC

and df2 like this:

> df2
# A tibble: 4 x 2
  ID    DateTime  
  <chr> <date>    
1 ID1   2018-04-12
2 ID1   2018-05-05
3 ID2   2018-04-23
4 ID2   2018-07-12

In df2, the second record for ID1 is not within the ID1 interval in df1. The second record for ID2 is also not within the ID2 interval in df1.

My current solution based on joining and the removing the joined column follows:

df_out <- df2 %>%
  left_join(.,df1,by="ID") %>%
  filter(.,DateTime %within% Interval) %>%
  select(.,-Interval)

> df_out
# A tibble: 2 x 4
  ID    DateTime   Date1      Date2     
  <chr> <date>     <chr>      <chr>     
1 ID1   2018-05-05 2018-04-16 2018-06-14
2 ID2   2018-04-23 2018-04-20 2018-06-25

I have the feeling a tidyverse alternative that would avoid joining and then removing the Interval column should exist.


Solution

  • There is a package called fuzzyjoin that can do a semi_join based on an interval. Semi join means that it filters the "left" dataframe depending on match to the "right" dataframe. Try:

    library(fuzzyjoin)
    df2 %>% 
      fuzzy_semi_join(df1, 
                      by=c("DateTime"="Date1", "DateTime"="Date2"),
                      match_fun=list(`>=`, `<=`))
    

    Which gives the result:

    # A tibble: 2 x 2
      ID    DateTime  
      <chr> <date>    
    1 ID1   2018-05-05
    2 ID2   2018-04-23