djangopostgresqldjango-modelsdatabase-optimization

Best way to process database in chunks with Django QuerySet?


I am running a batch operation over all rows in a database. This involves selecting every single model and doing something to it. It makes sense to split this into chunks and do it chunk by chunk.

I'm currently using Paginator, because it's convenient. This means that I need an ordering on the values so that they can be paged through in order. This does generate SQL statements which have order and limit clauses, and for each chunk I think Postgres may be sorting the whole table (although I can't claim to have any knowledge about internals). All I know is that the database is at around 50% CPU and I think that's too high just to be doing selects.

What's the best way to iterate over the entire table in a RDMBS/CPU-friendly way?

Assuming that the contents of the database doesn't change during the batch operation.


Solution

  • From your description you don't actually care about the sort order of the rows you process. If you have primary keys in your tables (which I expect!), this crude method of partitioning would be much faster:

    SELECT * FROM tbl WHERE id BETWEEN 0    AND 1000;
    SELECT * FROM tbl WHERE id BETWEEN 1001 AND 2000;
    ...
    

    This performs the same for any offset and (almost) the same for any size of table. Retrieve min and max of your primary key and partition accordingly:

    SELECT min(id), max(id) from tbl; -- then divide in suitable chunks
    

    As opposed to:

    SELECT * FROM tbl ORDER BY id LIMIT 1000;
    SELECT * FROM tbl ORDER BY id LIMIT 1000 OFFSET 1000;
    ...
    

    This is generally slower because all rows have to be sorted and performance degrades additionally with higher offsets and bigger tables.