rdplyrapache-arrow

Alternatives for distinct(.keep_all = TRUE) in arrow?


I have a larger than memory arrow dataset created by open_dataset() from partitioned parquet files that I need to use distinct(.keep_all = TRUE) on. I need to keep the computation on disk, therefore I'm using arrow to speed things up and not crash my R session.

I want to keep rows that have distinct values in columns a and b.

Example

df <- tibble(a = c(1,1,2,2),
             b = c(1,1,2,1),
             c = c("x", "y", "z", "a")) %>%
  arrow_table()


df %>% 
  distinct(a, b, .keep_all = TRUE)

This results in: Error: distinct() with .keep_all = TRUE not supported in Arrow

Desired Output

An arrow dataset with the following values.

  a     b c    
  <dbl> <dbl> <chr>
1     1     1 x    
2     2     2 z    
3     2     1 a  

I see others have had similar questions but it doesn't seem like arrow plans to incorporate use of .keep_all (see closed issue).

base::duplicated() would also work but it's not supported by arrow either. Any thoughts on how to work around this without using collect() (which crashes my R session)? TIA!


Solution

  • Back with an answer after many trials and tribulations. You can use to_duckdb() to use some dplyr verbs not supported by arrow.

    df <- tibble(a = c(1,1,2,2),
                 b = c(1,1,2,1),
                 c = c("x", "y", "z", "a")) %>%
     arrow_table() %>%
     to_duckdb() %>% # swap to duckdb to use distinct
     distinct(a, b, .keep_all = TRUE) %>%
    to_arrow() # back to arrow
    
    # Check that this returns an arrow object
    class(df)
    
    [1] "RecordBatchReader" "ArrowObject"      
    [3] "R6"  
    
    # Print table to make sure that distinct(.keep_all = TRUE) worked
    df %>% collect()
    
    # A tibble: 3 × 3
          a     b c    
      <dbl> <dbl> <chr>
    1     1     1 x    
    2     2     1 a    
    3     2     2 z