sqlpostgresqlupsertexclusion-constraint

INSERT or UPDATE based on changing timestamp


I have a table where I record events. These events will come into the system once a minute, per user - on the order of 10's of thousands a minute. But I don't need to save all of them.
If an event comes in within 90 seconds of the previous event, I want to update the previous row. If it's been more than 90 seconds, I want to insert a new row.

Example table:

create table events (
  id serial,
  user_id int references users(id) not null,
  created_at timestamp not null default now(),
  updated_at timestamp not null default now(),
  ...some other event columns here...
);

create index idx_events_user_id_updated_at on events (user_id, updated_at desc);

And the pseudocode would be something like:

  1. Fetch last event for user
  2. If updated_at is within the last 90 seconds:
    1. Update row, including updated_at with now()
  3. Otherwise:
    1. Insert new row

Is there a way to do this with a single Postgres statement?

I'm aware of on conflict, but I don't think it'll work for this use-case. The (user_id, updated_at) pair could be defined as a unique constraint, which could be used to trigger an on conflict, but the timestamps are arbitrary. These events come in "every minute", but not exactly on a minute (or even exactly a minute apart, due to network latency, server latency, etc etc, hence using 90 seconds to give a 30 second buffer). Truncating the timestamps to the minute would reduce the feature's usefulness, so I'd hate to do that just to handle upserts more cleanly.


Solution

  • Is there a way to do this with a single Postgres statement?

    Can be done with an UPSERT command. You need an EXCLUSION constraint on the timestamp range. For the leading integer column user_id, the additional module btree_gist must be installed. See:

    ALTER TABLE events
      ADD CONSTRAINT user_90sec
      EXCLUDE USING gist (user_id WITH =, tsrange(updated_at, updated_at + interval '90 sec') WITH &&);
    

    Notably, the same doesn't work for timestamptz because timestamptz + interval is only STABLE, while timestamp + interval is IMMUTABLE as required for the implied GiST index (or any index for that matter).

    Query:

    WITH input_rows(user_id, data) AS (
       VALUES
          (1, 'foo_new')  -- your input here
        , (2, 'bar_new')
        , (3, 'baz_new')
        -- more?
       )
    , ins AS (
       INSERT INTO events (user_id, data) 
       SELECT user_id, data FROM input_rows
       ON CONFLICT ON CONSTRAINT user_90sec DO NOTHING
       RETURNING user_id
       )
    UPDATE events e
    SET    updated_at = LOCALTIMESTAMP
         , data = i.data
    FROM   input_rows i
    LEFT   JOIN ins USING (user_id)
    WHERE  ins.user_id IS NULL
    AND    e.user_id = i.user_id
    AND    e.updated_at > LOCALTIMESTAMP - interval '90 sec';
    

    fiddle

    This assumes at most one row per user_id in the input.

    And no concurrent, competing writes. Else, there may be race conditions due to the tiny, but unavoidable time gap between INSERT and UPDATE. (ON CONFLICT ... DO UPDATE is not allowed based on an exclusion constraint.)

    Also, no ambiguous data types in the input, or you must add an explicit cast. See:

    Depending on details of your case, there may well be faster solutions. Like, just write all incoming rows (with a substantially faster COPY command), and later consolidate in bulk operations.

    While sticking with this solution, it may be faster to add a generated column for the tsrange and base the exclusion constraint (and its implied GiST index) on that. Bloats the table row a bit, though. See: