rr-dbiduckdb

Fast upsert into duckdb


I have a dataset where I need to upsert data (on conflict replace some value columns). As this is the bottleneck of an app, I want this to be fairly optimized. But duckdb is really slow compared to sqlite in this instance. What am I doing wrong here?

MWE

A quick example is this table, where id and name are unique and where val and created_at are values which should be updated in a conflict.

I set up the table like this

con <- DBI::dbConnect(duckdb::duckdb())
q <- "CREATE TABLE main (
  id INTEGER NOT NULL,
  name TEXT NOT NULL,
  val REAL NOT NULL,
  created_at TIMESTAMP NOT NULL,
  UNIQUE(id, name)
)"
DBI::dbExecute(con, q)

Then I simulate some values to insert into the table

# Create a data frame
create_data <- function(N, k_letters) {
  data.frame(
    id = sample.int(100, N, replace = TRUE),
    name = replicate(N, paste(sample(letters, k_letters), collapse = "")),
    val = runif(N),
    created_at = Sys.time()
  ) |>
  # to take only unique values...
  dplyr::group_by(id, name) |> dplyr::slice(1) |> dplyr::ungroup()
}

set.seed(123)
df_orig <- create_data(10000, 2)
nrow(df_orig)
#> [1] 9285

DBI::dbAppendTable(con, "main", df_orig)

Now I simulate new data ...

set.seed(124)
new_data <- create_data(100000, 2)
nrow(new_data)
#>[1] 50902

# how many are in the original but remain as they are not in the new data?
anti_join(df_orig, new_data, by = c("id", "name")) |> nrow()
#> [1] 1971

# how many are overwritten by the new?
semi_join(df_orig, new_data, by = c("id", "name")) |> nrow()
#> [1] 7314

# how many are newly added
anti_join(new_data, df_orig, by = c("id", "name")) |> nrow()
#> [1] 43588

... and upsert it into the table

upsert_db <- function(con, data) {
  q <- "INSERT INTO test
    VALUES ($id, $name, $val, $created_at)
    ON CONFLICT (id, name)
    DO UPDATE SET val = excluded.val, created_at = excluded.created_at;"

  res <- dbSendStatement(con, q)
  dbBind(res, params = unname(as.list(data)))
  rr <- dbGetRowsAffected(res)
  dbClearResult(res)
  rr
}
t0 <- Sys.time()
upsert_db(con, new_data)
#> [1] 50902
difftime(Sys.time(), t0)
#> Time difference of 16.30046 secs

This works but is really slow, similar code with SQLite takes less than a second.

Is there anything obvious that I am missing?


Solution

  • For duckdb, a looping insert is going to be much slower than using a bulk appender.

    Using a bulk operation in this scenario is possible by using a temporary table to first load into, then INSERT.

    upsert_db <- function(con, data) {
      # create an empty table matching main
      ct <- "CREATE OR REPLACE TEMP TABLE stg as 
      SELECT * FROM main WHERE 1 = 2"
    
      dbExecute(con, ct)
      dbAppendTable(con, "stg", data)
    
      # merge the data between the two tables
      iq <- "INSERT INTO main
        select * from stg
        ON CONFLICT (id, name)
        DO UPDATE SET val = excluded.val, created_at = excluded.created_at;"
      rr <- dbExecute(con, iq)
    
      # drop the source merge table
      dq <- "DROP TABLE stg"
      dbExecute(con, dq)
      rr
    }
    

    By using a temp table and loading it with dbAppendTable we get much better performance for the UPSERT operation.

    t0 <- Sys.time()
    upsert_db(con, new_data)
    #> [1] 50902
    difftime(Sys.time(), t0)
    #> Time difference of 0.03464794 secs