rdplyrapache-arrowduckdb

Mixing dplyr syntax and SQL in duckdb (dbplyr)


This question is closely related to this question, where I’m employing a mixed approach using both arrow and duckdb in R.

I’ve noticed that duckdb tables don’t always “persist” dplyr queries when switching contexts. For instance,

library(tidyverse)
library(duckdb)
library(arrow)
library(dbplyr)

iris_duck <-  iris |>
  to_duckdb(table_name = 'iris_duck') |> 
  mutate(new = Sepal.Length)

print(iris_duck)

> iris_duck
# Source:   SQL [?? x 6]
# Database: DuckDB 0.8.0 [unknown@Linux 4.18.0-513.24.1.el8_9.x86_64:R 4.2.3/:memory:]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   new
          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <dbl>
 1          5.1         3.5          1.4         0.2 setosa    5.1
 2          4.9         3            1.4         0.2 setosa    4.9
 3          4.7         3.2          1.3         0.2 setosa    4.7
 4          4.6         3.1          1.5         0.2 setosa    4.6
 5          5           3.6          1.4         0.2 setosa    5  
 6          5.4         3.9          1.7         0.4 setosa    5.4
 7          4.6         3.4          1.4         0.3 setosa    4.6
 8          5           3.4          1.5         0.2 setosa    5  
 9          4.4         2.9          1.4         0.2 setosa    4.4
10          4.9         3.1          1.5         0.1 setosa    4.9

However, when accessed from the connection,

con = iris_duck$src$con
tbl(con, sql("SELECT * from iris_duck"))

> tbl(con, sql("SELECT * from iris_duck"))
# Source:   SQL [?? x 5]
# Database: DuckDB 0.8.0 [unknown@Linux 4.18.0-513.24.1.el8_9.x86_64:R 4.2.3/:memory:]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ more rows
# ℹ Use `print(n = ...)` to see more rows
> 

I’m looking for a way to make queries persist so that I can continue working in SQL for complex tasks from where I left off using dplyr syntax.

Is there a method to achieve this, without first calling compute()?


Solution

  • Short answer: Yes, it’s possible.

    Long answer:

    Since dbplyr acts as a dplyr to SQL converter, you can interweave both dplyr syntax and raw SQL in your workflow.

    An important step is to use tbl(con, sql()) instead of DBI::dbGetQuery() or DBI::dbExecute() calls to keep the pipeline.

    This allows you to work lazily without materializing intermediate results using compute().

    By rendering and storing an interim SQL subquery, you can execute raw SQL queries within a dplyr workflow seamlessly.

    library(tidyverse)
    library(duckdb)
    library(dbplyr)
    library(arrow)
    
    con <- dbConnect(duckdb())
    
    # Using dplyr syntax with dbplyr
    step1 <- iris |>
      to_duckdb(con = con, table_name = 'iris_duck', auto_disconnect = FALSE) |> 
      mutate(new = Sepal.Length + 1)
    
    # Convert the dplyr chain to an SQL query
    query1_sql <- sql_render(step1)
    
    # Use a complex SQL query with previously generated subquery
    complex_sql <- str_glue("SELECT * FROM ({query1_sql})")
    
    # Submit the SQL query with `tbl(con, sql())` to DuckDB (lazy execution)
    step2 <- tbl(con, sql(complex_sql))
    
    # Continue using dplyr syntax on top of the SQL subquery
    step2 <- step2 |> mutate(yet_another = Sepal.Length + 2)
    
    # Final query execution
    step2 |> collect()