Simplified scenario:
I want to update one table A (10 million rows) with a value from another table B (20 mil rows). The two tables are linked with ids.
It took more than 7hrs to update the whole thing in one go. (I don't know exactly how long as I stopped the script in the middle.)
So my idea is to update table A in batches using OFFSET
and LIMIT
clauses. So far with no luck.
Wrapped up in a procedure, the code looks like below:
DECLARE
offset_number integer := 0;
batch_size integer := 1000;
BEGIN
LOOP
UPDATE TableA temp1
SET TableA_column_value_to_be_updated = (
SELECT
tableB_column_value
FROM
TableB temp2
WHERE
temp2.id = temp1.id
AND some_other_conditions_in_TableB
)
WHERE
some_other_conditions_in_Table
OFFSET offset_number
LIMIT batch_size ;
COMMIT;
offset_number := offset_number + batch_size;
EXIT WHEN NOT FOUND;
END LOOP;
END;
The engine reports an error with exception:
org.jkiss.dbeaver.model.sql.DBSQLException:
SQL Error [42601]: ERROR: syntax error at or near "OFFSET"
I have no idea what it is. Notably, it seems to work without OFFSET
and LIMIT
.
Any ideas why this would happen? Should I use other loop statement?
LIMIT
and OFFSET
are not in the syntax of an SQL UPDATE
statement. You need SELECT
for that.
Also, OFFSET
scales poorly to "paginate" a big table. Remember the upper bound from the last iteration instead.
Something like this could work:
CREATE OR REPLACE PROCEDURE upd_in_batches(_batch_size int = 1000)
LANGUAGE plpgsql AS
$proc$
DECLARE
_id_bound int = 0; -- or whatever?
BEGIN
LOOP
WITH sel AS (
SELECT a.id -- id = PK!
FROM tablea a
WHERE a.id > _id_bound
-- AND <some other conditions in Table A>
ORDER BY a.id
LIMIT _batch_size
FOR UPDATE
)
, upd AS (
UPDATE tablea a
SET target_col = b.b_source_col
FROM sel s
JOIN tableb b USING (id)
WHERE a.id = s.id
AND a.target_col IS DISTINCT FROM b.b_source_col
)
SELECT max(id) -- always returns a row
FROM sel
INTO _id_bound;
IF _id_bound IS NULL THEN
EXIT; -- no more rows found; we're done, exit loop
ELSE
COMMIT;
END IF;
END LOOP;
END
$proc$
Use a SELECT
statement instead to apply your LIMIT
. To avoid race conditions with concurrent writes, throw in a locking clause (FOR UPDATE
). You may or may not need that.
You might be able to iUPDATE
directly and just increment lower & upper bound for the filter on id
instead, which is cheaper. Depends on the details of your setup and requirements. Each has its caveats.
See: