I have a dataset where I need to upsert data (on conflict replace some value columns). As this is the bottleneck of an app, I want this to be fairly optimized. But duckdb is really slow compared to sqlite in this instance. What am I doing wrong here?
A quick example is this table, where id
and name
are unique and where val
and created_at
are values which should be updated in a conflict.
I set up the table like this
con <- DBI::dbConnect(duckdb::duckdb())
q <- "CREATE TABLE main (
id INTEGER NOT NULL,
name TEXT NOT NULL,
val REAL NOT NULL,
created_at TIMESTAMP NOT NULL,
UNIQUE(id, name)
)"
DBI::dbExecute(con, q)
Then I simulate some values to insert into the table
# Create a data frame
create_data <- function(N, k_letters) {
data.frame(
id = sample.int(100, N, replace = TRUE),
name = replicate(N, paste(sample(letters, k_letters), collapse = "")),
val = runif(N),
created_at = Sys.time()
) |>
# to take only unique values...
dplyr::group_by(id, name) |> dplyr::slice(1) |> dplyr::ungroup()
}
set.seed(123)
df_orig <- create_data(10000, 2)
nrow(df_orig)
#> [1] 9285
DBI::dbAppendTable(con, "main", df_orig)
Now I simulate new data ...
set.seed(124)
new_data <- create_data(100000, 2)
nrow(new_data)
#>[1] 50902
# how many are in the original but remain as they are not in the new data?
anti_join(df_orig, new_data, by = c("id", "name")) |> nrow()
#> [1] 1971
# how many are overwritten by the new?
semi_join(df_orig, new_data, by = c("id", "name")) |> nrow()
#> [1] 7314
# how many are newly added
anti_join(new_data, df_orig, by = c("id", "name")) |> nrow()
#> [1] 43588
... and upsert it into the table
upsert_db <- function(con, data) {
q <- "INSERT INTO test
VALUES ($id, $name, $val, $created_at)
ON CONFLICT (id, name)
DO UPDATE SET val = excluded.val, created_at = excluded.created_at;"
res <- dbSendStatement(con, q)
dbBind(res, params = unname(as.list(data)))
rr <- dbGetRowsAffected(res)
dbClearResult(res)
rr
}
t0 <- Sys.time()
upsert_db(con, new_data)
#> [1] 50902
difftime(Sys.time(), t0)
#> Time difference of 16.30046 secs
This works but is really slow, similar code with SQLite takes less than a second.
Is there anything obvious that I am missing?
For duckdb, a looping insert is going to be much slower than using a bulk appender.
Using a bulk operation in this scenario is possible by using a temporary table to first load into, then INSERT
.
upsert_db <- function(con, data) {
# create an empty table matching main
ct <- "CREATE OR REPLACE TEMP TABLE stg as
SELECT * FROM main WHERE 1 = 2"
dbExecute(con, ct)
dbAppendTable(con, "stg", data)
# merge the data between the two tables
iq <- "INSERT INTO main
select * from stg
ON CONFLICT (id, name)
DO UPDATE SET val = excluded.val, created_at = excluded.created_at;"
rr <- dbExecute(con, iq)
# drop the source merge table
dq <- "DROP TABLE stg"
dbExecute(con, dq)
rr
}
By using a temp table and loading it with dbAppendTable
we get much better performance for the UPSERT
operation.
t0 <- Sys.time()
upsert_db(con, new_data)
#> [1] 50902
difftime(Sys.time(), t0)
#> Time difference of 0.03464794 secs