I'm writing a Django-ORM enchancement that attempts to cache models and postpone model saving until the end of the transaction. It's all almost done, however I came across an unexpected difficulty in SQL syntax.
I'm not much of a DBA, but from what I understand, databases don't really work efficiently for many small queries. Few bigger queries are much better. For example it's better to use large batch inserts (say 100 rows at once) instead of 100 one-liners.
Now, from what I can see, SQL doesn't really supply any statement to perform a batch update on a table. The term seems to be confusing so, I'll explain what I mean by that. I have an array of arbitrary data, each entry describing a single row in a table. I'd like to update certain rows in the table, each using data from its corresponding entry in the array. The idea is very similar to a batch insert.
For example: My table could have two columns "id"
and "some_col"
. Now the array describing the data for a batch update consists of three entries (1, 'first updated')
, (2, 'second updated')
, and (3, 'third updated')
. Before the update the table contains rows: (1, 'first')
, (2, 'second')
, (3, 'third')
.
I came accross this post:
Why are batch inserts/updates faster? How do batch updates work?
which seems to do what I want, however I can't really figure out the syntax at the end.
I could also delete all the rows that require updating and reinsert them using a batch insert, however I find it hard to believe that this would actually perform any better.
I work with PostgreSQL 8.4, so some stored procedures are also possible here. However as I plan to open source the project eventually, any more portable ideas or ways to do the same thing on a different RDBMS are most welcome.
Follow up question: How to do a batch "insert-or-update"/"upsert" statement?
Test results
I've performed 100x times 10 insert operations spread over 4 different tables (so 1000 inserts in total). I tested on Django 1.3 with a PostgreSQL 8.4 backend.
These are the results:
Conclusion: execute as many operations as possible in a single connection.execute(). Django itself introduces a substantial overhead.
Disclaimer: I didn't introduce any indices apart from default primary key indices, so insert operations could possibly run faster because of that.
I've used 3 strategies for batch transactional work:
flush()
method against the Hibernate Session
, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize
, when fetching associations, Hibernate will use IN
instead of =
, leading to fewer SELECT
statements to load up the collections.