Update with limit and offset

Simplified scenario:

I want to update one table A (10 million rows) with a value from another table B (20 mil rows). The two tables are linked with ids.

It took more than 7hrs to update the whole thing in one go. (I don't know exactly how long as I stopped the script in the middle.)
So my idea is to update table A in batches using OFFSET and LIMIT clauses. So far with no luck.

Wrapped up in a procedure, the code looks like below:

DECLARE 
    offset_number integer := 0;
    batch_size integer := 1000; 

BEGIN
    LOOP 
        UPDATE TableA temp1
        SET TableA_column_value_to_be_updated = (
            SELECT 
                tableB_column_value
            FROM 
                TableB temp2
            WHERE 
                temp2.id = temp1.id 
                AND some_other_conditions_in_TableB    
            ) 
        WHERE    
             some_other_conditions_in_Table
        OFFSET offset_number 
        LIMIT batch_size ; 
            
        COMMIT; 
    
        offset_number := offset_number + batch_size;
        EXIT WHEN NOT FOUND; 
    END LOOP;
END;

The engine reports an error with exception:

org.jkiss.dbeaver.model.sql.DBSQLException:
SQL Error [42601]: ERROR: syntax error at or near "OFFSET"

I have no idea what it is. Notably, it seems to work without OFFSET and LIMIT.

Any ideas why this would happen? Should I use other loop statement?

Solution

LIMIT and OFFSET are not in the syntax of an SQL UPDATE statement. You need SELECT for that.
Also, OFFSET scales poorly to "paginate" a big table. Remember the upper bound from the last iteration instead.

Something like this could work:

CREATE OR REPLACE PROCEDURE upd_in_batches(_batch_size int = 1000)
  LANGUAGE plpgsql AS
$proc$
DECLARE
   _id_bound int = 0;  -- or whatever?
BEGIN
   LOOP
      WITH sel AS (
         SELECT a.id  -- id = PK!
         FROM   tablea a
         WHERE  a.id > _id_bound
      -- AND    <some other conditions in Table A>
         ORDER  BY a.id
         LIMIT  _batch_size
         FOR    UPDATE
         )
      , upd AS (
         UPDATE tablea a
         SET    target_col = b.b_source_col
         FROM   sel s
         JOIN   tableb b USING (id)
         WHERE  a.id = s.id
         AND    a.target_col IS DISTINCT FROM b.b_source_col
         )
      SELECT max(id)  -- always returns a row
      FROM   sel
      INTO   _id_bound;

      IF _id_bound IS NULL THEN
          EXIT;  -- no more rows found; we're done, exit loop
      ELSE
         COMMIT;
      END IF;
   END LOOP;
END
$proc$

Use a SELECT statement instead to apply your LIMIT. To avoid race conditions with concurrent writes, throw in a locking clause (FOR UPDATE). You may or may not need that.

You might be able to iUPDATE directly and just increment lower & upper bound for the filter on id instead, which is cheaper. Depends on the details of your setup and requirements. Each has its caveats.

See: