postgresql

Is there any difference between CTE INSERT+UPDATE in one go versus two different queries


Imagine a table like this

create table public.my_table (
    id          uuid primary key,
    category_id uuid not null,
    name        text not null,
    is_active   bool not null,
    created_at timestamp with time zone not null default now(),
    updated_at timestamp with time zone not null default now()
);

There is a rule: a category_id must have only one record with is_active=true

I can come up with 2 ways of adding records:

#1 using CTE insert to add a new active record + update to deactivate previous:

WITH new_record AS (
    INSERT INTO my_table (id, category_id, name, is_active) VALUES (?, ?, ?, true)
    RETURNING id, category_id
)
UPDATE my_table
SET is_active  = false,
    updated_at = now()
FROM new_record
WHERE my_table.category_id = new_record.category_id
  AND my_table.id <> new_record.id
  AND my_table.is_active = true

#2 making separate insert & update queries in a transaction:

BEGIN TRANSACTION;

INSERT INTO my_table (id, category_id, name, is_active) VALUES (?, ?, ?, true);

UPDATE my_table
   SET is_active  = false,
       updated_at = now()
   WHERE my_table.category_id = ?
         AND my_table.id <> ?
         AND my_table.is_active = true;

COMMIT;

I like the second way for its simplicity.

Does the first way have any benefits compared to the second?


Solution

  • does the first way have any pros?

    It does, it's one atomic operation.

    By default you operate in read committed transaction isolation mode which means there's a chance that in your second example one of your workers will
    demo at db<>fiddle

    1. Open the transaction, insert the new row and promote it to active.
    2. Meanwhile, some other worker will insert an even newer active row and commit it.
    3. The first worker will proceed to run the update to make sure all other rows of the category are deacativated. At that point it'll notice the newer row and deactivate it. As a result, once it commits, your active row might not be the latest one, because the slower worker overwrote the work from the faster, most recent messenger.
    4. If the other worker didn't commit before the first one issued the update, they will both deactivate some older row, and after that they'll both commit their active row. As a result, you'll have two active rows in a category. It can go as high as the number of concurrent clients.

    Your first example does the whole insert-one-update-another operation in one go, on one snapshot that doesn't allow a window for anyone to slip in any changes. The doc immediately recommended by @Adrian Klaver puts it well:

    when using data-modifying statements in WITH, the order in which the specified updates actually happen is unpredictable. All the statements are executed with the same snapshot (see Chapter 13), so they cannot “see” one another's effects on the target tables. This alleviates the effects of the unpredictability of the actual order of row updates

    If my_table.category_id is really a foreign key pointing at the primary key of category.id, it's better to move is_active to category table. PK means it's unique, so it'll be smaller and faster to search for the category and update a pointer at the currently active my_table.id:

    create table category(
      id uuid primary key
     ,active_my_table_id uuid references my_table(id)
    );
    

    To populate it, rather than taking all where is_active, grab a distinct on, in case there are inactive categories:

    insert into category
    select distinct on(category_id)category_id
         , id
    from my_table
    order by category_id
            ,is_active desc
            ,created_at desc
            ,updated_at desc;
    
    alter table public.my_table 
      drop column is_active cascade--drops the exclusion constraint too
     ,add foreign key(category_id)references category(id);
    

    Then, inserting a new active row for a category:

    with new_record as (
        insert into my_table (id, category_id, name) 
        values ( get_row_uuid(row_)
                ,get_category_uuid(category_)
                ,row_)
        returning id, category_id
    )
    insert into category
    select category_id, id
    from new_record
    on conflict(id)
    do update set active_my_table_id = excluded.active_my_table_id
    

    From technical standpoint, expressing your business rule as an actual SQL-level constraint is optional. But if you define it, it automatically rejects any violation attempts, throwing a clear error message and logging the event, preventing the table from entering invalid state with multiple active records for you to track down and fix. Plus, in the config above, you can use it to run insert..on conflict..do update.


    That being said, even if you have the separate category table, adding, maintaining and manipulating a 128-bit uuid fk in there will obviously take more space than the 8-bit boolean in my_table. If you wish to keep it that way, you can follow @Bergi's suggestion below:
    demo at db<>fiddle

    alter table my_table 
      add constraint one_active_per_category
      exclude (category_id with =)
      where (is_active)
      deferrable initially deferred;
    

    It keeps the is_active untouched and guards the table against inserting more than a single active row per category.

    Making the constraint deferrable initially deferred means it'll only be checked at the end of transaction so you can deactivate then activate or the other way around, as long as in the end you commit a state that respects the rule. Otherwise, you need to deactivate first, then activate to avoid a window of time when the old and the new entry for the category have both is_active=true.
    If you do it that way, a regular unique is simpler and works faster:

    create unique index on my_table(category_id)
      where (is_active);
    

    Unique and exclusion constraints are backed by indexes, so setting up just an index instead of a constraint will have the same effect. You could normally tie the index to the table so that it becomes a proper constraint, but that's not supported for partial indices.

    If multiple concurrent workers try the operation, all but one will hit a lock and get rejected, the exact point and error message depending on whether they try that atomically or in transactions with multiple steps, and if it's deferred or not. The advantage of marking what's active in the separate category table and doing that in one step is they'll instead get queued up and wait - first come, first served.