I have a larger than memory arrow dataset created by open_dataset() from partitioned parquet files that I need to use distinct(.keep_all = TRUE)
on. I need to keep the computation on disk, therefore I'm using arrow to speed things up and not crash my R session.
I want to keep rows that have distinct values in columns a and b.
Example
df <- tibble(a = c(1,1,2,2),
b = c(1,1,2,1),
c = c("x", "y", "z", "a")) %>%
arrow_table()
df %>%
distinct(a, b, .keep_all = TRUE)
This results in:
Error: distinct() with .keep_all = TRUE not supported in Arrow
Desired Output
An arrow dataset with the following values.
a b c
<dbl> <dbl> <chr>
1 1 1 x
2 2 2 z
3 2 1 a
I see others have had similar questions but it doesn't seem like arrow plans to incorporate use of .keep_all
(see closed issue).
base::duplicated()
would also work but it's not supported by arrow either. Any thoughts on how to work around this without using collect()
(which crashes my R session)? TIA!
Back with an answer after many trials and tribulations. You can use to_duckdb()
to use some dplyr verbs not supported by arrow.
df <- tibble(a = c(1,1,2,2),
b = c(1,1,2,1),
c = c("x", "y", "z", "a")) %>%
arrow_table() %>%
to_duckdb() %>% # swap to duckdb to use distinct
distinct(a, b, .keep_all = TRUE) %>%
to_arrow() # back to arrow
# Check that this returns an arrow object
class(df)
[1] "RecordBatchReader" "ArrowObject"
[3] "R6"
# Print table to make sure that distinct(.keep_all = TRUE) worked
df %>% collect()
# A tibble: 3 × 3
a b c
<dbl> <dbl> <chr>
1 1 1 x
2 2 1 a
3 2 2 z