I am working with large datasets using {arrow, duckdb, dbplyr}.
I am experiencing some issues after opening a bunch of tbl_duckdb_connections. I do this as I need to manipulate and join them until I have only one tbl_duckdb_connection.
When I finally try to pull some data from the final duckdb object (like a performing a slice and uploading it, or even just counting the number of rows), it takes forever and I get an "Out of Memory" error, which also says that "Database is launched in in-memory mode and no temporary directory is specified".
What am I doing wrong?
My code looks something like this, but there are more joins involved:
db1 <- arrow::open_dataset(source = paste0(path, "data1.csv"), format = "csv") %>%
arrow::to_duckdb()
db2 <- arrow::open_dataset(source = paste0(path, "data2.csv"), format = "csv") %>%
arrow::to_duckdb()
db3 <- inner_join(bd1, bd2, by = "id") %>%
collect()
This is strange because even counting the number of rows in db3 using the following code takes a long time and gives the same error:
db3 %>%
ungroup() %>%
summarise(num = n())
The arrow::to_duckdb()
calls the private arrow_duck_connection()
function that basically calls duckdb::duckdb()
, with default args, creating an in-memory DB.
The error message you get (Out of memory) seems to indicate thatthe duckdb DB runs out of memory.
You could then create explicitly a duckdb connection x
using duckdb::duckdb("mydir"), and pass it to arrow::to_duckdb(x)
.