sqlrdplyrapache-arrowduckdb

SQL query on arrow duckdb workflow in R


I'm wondering if it is possible to send an SQL query on duckdb during arrow workflow in R. (https://duckdb.org/2021/12/03/duck-arrow.html)

I know it is intended to use dplyr verbs, but there are some verbs that are not smoothly translated that codes that works on dplyr backend but not on either duckdb or arrow.

I would like to use a direct SQL query during the call, something like below example:

library(duckdb)
library(arrow)
library(dplyr)

# Open dataset using year,month folder partition
ds <- arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))

ds %>%
  # Pass off to DuckDB
  to_duckdb() |>
  SQL_QUERY("SELECT * LIMIT 100") |> # <- something like this
  collect()

Solution

  • Depending on what you are trying to accomplish, you can use SQL query fragments with the dplyr verbs:

    ds |>
      to_duckdb() |>
      filter(sql("vendor_name = 'VTS' AND year = 2009 AND month = 1")) |>
      collect()
    

    You can find more information about this approach in one of the dbplyr vignettes.

    This approach doesn't extend to more complex queries. However, you can use the information returned by to_duckdb() to send full SQL queries with the DBI package:

    db_tbl <-  ds |>
          to_duckdb(table_name = "nyc_tbl")
    
    con <- db_tbl$src$con
    DBI::dbGetQuery(con, "SELECT * FROM 'nyc_tbl' LIMIT 100")