This question is closely related to this question, where I’m employing a mixed approach using both arrow and duckdb in R.
I’ve noticed that duckdb tables don’t always “persist” dplyr queries when switching contexts. For instance,
library(tidyverse)
library(duckdb)
library(arrow)
library(dbplyr)
iris_duck <- iris |>
to_duckdb(table_name = 'iris_duck') |>
mutate(new = Sepal.Length)
print(iris_duck)
> iris_duck
# Source: SQL [?? x 6]
# Database: DuckDB 0.8.0 [unknown@Linux 4.18.0-513.24.1.el8_9.x86_64:R 4.2.3/:memory:]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3 1.4 0.2 setosa 4.9
3 4.7 3.2 1.3 0.2 setosa 4.7
4 4.6 3.1 1.5 0.2 setosa 4.6
5 5 3.6 1.4 0.2 setosa 5
6 5.4 3.9 1.7 0.4 setosa 5.4
7 4.6 3.4 1.4 0.3 setosa 4.6
8 5 3.4 1.5 0.2 setosa 5
9 4.4 2.9 1.4 0.2 setosa 4.4
10 4.9 3.1 1.5 0.1 setosa 4.9
However, when accessed from the connection,
con = iris_duck$src$con
tbl(con, sql("SELECT * from iris_duck"))
> tbl(con, sql("SELECT * from iris_duck"))
# Source: SQL [?? x 5]
# Database: DuckDB 0.8.0 [unknown@Linux 4.18.0-513.24.1.el8_9.x86_64:R 4.2.3/:memory:]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ more rows
# ℹ Use `print(n = ...)` to see more rows
>
I’m looking for a way to make queries persist so that I can continue working in SQL for complex tasks from where I left off using dplyr syntax.
Is there a method to achieve this, without first calling compute()
?
Short answer: Yes, it’s possible.
Long answer:
Since dbplyr
acts as a dplyr
to SQL converter, you can interweave both dplyr
syntax and raw SQL in your workflow.
An important step is to use tbl(con, sql())
instead of DBI::dbGetQuery()
or DBI::dbExecute()
calls to keep the pipeline.
This allows you to work lazily without materializing intermediate results using compute()
.
By rendering and storing an interim SQL subquery, you can execute raw SQL queries within a dplyr
workflow seamlessly.
library(tidyverse)
library(duckdb)
library(dbplyr)
library(arrow)
con <- dbConnect(duckdb())
# Using dplyr syntax with dbplyr
step1 <- iris |>
to_duckdb(con = con, table_name = 'iris_duck', auto_disconnect = FALSE) |>
mutate(new = Sepal.Length + 1)
# Convert the dplyr chain to an SQL query
query1_sql <- sql_render(step1)
# Use a complex SQL query with previously generated subquery
complex_sql <- str_glue("SELECT * FROM ({query1_sql})")
# Submit the SQL query with `tbl(con, sql())` to DuckDB (lazy execution)
step2 <- tbl(con, sql(complex_sql))
# Continue using dplyr syntax on top of the SQL subquery
step2 <- step2 |> mutate(yet_another = Sepal.Length + 2)
# Final query execution
step2 |> collect()