
Joining Arrow tables in R without overflowing memory or exceeding Acero's "bytes of key data" limit

I'm working with big data, using R and Apache Arrow. My data is split between two datasets, call them:

To aggregate vals down to a manageable size I need to first join with meta. I believe that doing this in a memory-efficient, piecewise manner is exactly what Arrow excels at, but I can't make it happen.

Here's my attempt:

vals = open_dataset(path_to_vals)
meta = open_dataset(path_to_meta)
temp = left_join(vals, meta, by='id') %>% 
  group_by(grouping_variables_from_meta) %>% 
  summarise(across(all_of(variables_from_vals), mean))
write_dataset(temp, 'some_path')

As soon as I invoke write_dataset my memory usage balloons. It seems to me that Arrow is trying to do the join by holding both tables in memory. Even if I temporarily sidestep this problem by requesting absurd amounts of memory from the cluster (not a long-term solution), the process eventually fails with:

Error: Invalid: There are more than 2^32 bytes of key data. Acero cannot process a join of this magnitude.

I have used Arrow to summarise much larger datasets in the past with a) very little memory usage and b) no errors about too large key data. But, on those occasions I always had the grouping variables in a single dataset and didn't need to join them first. So, I'm a pretty sure the join step is where the problem lies.

Should I be doing this differently? Is there something I've misunderstood?


R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Rocky Linux 8.7 (Green Obsidian)

Matrix products: default
BLAS:   /n/sw/helmod-rocky8/apps/Core/R/4.3.1-fasrc01/lib64/R/lib/libRblas.so 
LAPACK: /n/sw/helmod-rocky8/apps/Core/R/4.3.1-fasrc01/lib64/R/lib/libRlapack.so;  LAPACK version 3.11.0

time zone: America/New_York
tzcode source: system (glibc)

  • Due to the way acero (the C++ execution engine that is used for the joins) stores the key data (uint32_t) joins with more than 4gb of key data are not supported at the moment.

    The error message you are seeing is caused by a check to prevent data from being clobbered and silently returning broken results. But it seems there are some issues with that check that are being worked on in https://github.com/apache/arrow/pull/37709 which could mean you are affected even with <4gb of key data.