I am analysing a large dataset saved as a hive-partitioned set of parquet files using the apache arrow package. The whole data is too large to load into memory, so I would like to process it in chunks using a map function (and eventually a future_map so that I can run it in parallel. Unfortunately I get an error message whenever I try to use the arrow dataset functions inside a map call.
Below is a reprex using the diamonds dataset.
library(arrow)
library(tidyverse)
library(duckdb)
# Load the colours dataset
data(diamonds)
colors = diamonds$color %>% unique()
# Write the data to a hive dataset
diamonds %>%
write_dataset(path = 'diamonds', partitioning = c('color', 'clarity'))
rm(diamonds)
# Load it back in with Arrow and convert to DuckDB
diamonds = open_dataset('diamonds') %>%
to_duckdb()
# Check it works (prints a table filtered to E)
diamonds %>% filter(color == "E")
# Get separate table from each colour
colors %>%
map(function(x) {
diamonds %>%
filter(color == x) %>%
collect()
})
This gives the following error message:
Error in `map()`:
ℹ In index: 1.
Caused by error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_execute: Failed to run query
Error: Conversion Error: Could not convert string 'D' to DOUBLE
LINE 3: WHERE (color = x)
If I run without converting to a duckDB then I get a similar error:
Error in `map()`:
ℹ In index: 1.
Caused by error in `compute.arrow_dplyr_query()`:
! NotImplemented: Function 'equal' has no kernel matching input types (string, double)
Any ideas how I can overcome this issue?
x
is a name of a column in diamonds
. Use .env$x
or !!x
.