sqlpostgresqlconcurrencytransactionslocking

PostgreSQL concurrent transaction issues


I'm currently building a crawler. Multiple crawling workers access the same PostgreSQL database. Sadly I'm encountering issues with the main transaction presented here:

BEGIN ISOLATION LEVEL SERIALIZABLE;
    UPDATE webpages
    SET locked = TRUE
    WHERE url IN 
        (
            SELECT DISTINCT ON (source) url
            FROM webpages
            WHERE
                (
                    last IS NULL
                    OR
                    last < refreshFrequency
                )
                AND
                locked = FALSE
            LIMIT limit
        )
    RETURNING *;
COMMIT;

I tried two different transaction isolation levels:

I'm fairly new to PostgreSQL and SQL in general so I'm really not sure what I could do to fix this issue.

Update:
PostgreSQL version is 9.2.x.
webpage table definition:

CREATE TABLE webpages (
  last timestamp with time zone,
  locked boolean DEFAULT false,
  url text NOT NULL,
  source character varying(255) PRIMARY KEY
);

Solution

  • For Postgres 9.5 or later

    Consider SKIP LOCKED instead. See:

    Original answer for Postgres 9.2

    Clarification

    This is how I understand the task:

    "Lock a maximum of limit URLs which fulfill some criteria and are not locked, yet. To spread out the load on sources, every URL should come from a different source."

    DB design

    Assuming a separate table source. This makes the job faster and easier. If you don't have such a table, create it, it's the proper design anyway:

    CREATE TABLE source (
      source_id serial NOT NULL PRIMARY KEY
    , source    text   NOT NULL
    );
    
    CREATE TABLE webpage (
      source_id int       NOT NULL REFERENCES source
    , url       text      NOT NULL PRIMARY KEY
    , locked    boolean   NOT NULL DEFAULT false,       -- may not be needed
    , last      timestamp NOT NULL DEFAULT '-infinity'  -- makes query simpler
    );
    

    Alternatively you can use a recursive CTE efficiently:

    Basic solution with advisory locks

    Advisory locks make this safe and cheap even in default read committed isolation level:

    UPDATE webpage w
    SET    locked = true
    FROM  (
        SELECT (SELECT url
                FROM   webpage
                WHERE  source_id = s.source_id
                AND   (last >= refreshFrequency) IS NOT TRUE
                AND    locked = FALSE
                AND    pg_try_advisory_xact_lock(url)  -- only true if free
                LIMIT  1  -- get 1 URL per source
               ) AS url
     FROM  (
        SELECT source_id  -- the FK column in webpage
        FROM   source
        ORDER  BY random()
        LIMIT  limit      --  random selection of "limit" sources
        ) s
     FOR    UPDATE
     ) l
    WHERE  w.url = l.url
    RETURNING *;
    

    Alternatively, you could work with only advisory locks and not use the table column locked at all. Basically just run the the SELECT statement. Locks are kept until the end of the transaction. You can use pg_try_advisory_lock() instead to keep the locks till the end of the session. Only UPDATE once to set last when done (and possible release the advisory lock).

    Other major points

    Recursive alternative

    To get the full limit number of rows if available, use a RECURSIVE CTE and iterate all sources until you found enough or no more can be found.

    As I mentioned above, you may not need the column locked at all and operate with advisory locks only (cheaper). Just set last at the end of the transaction, before you start the next round.

    WITH RECURSIVE s AS (
       SELECT source_id
            , row_number() OVER (ORDER BY random()) AS rn
       FROM   source  -- you might exclude "empty" sources early ...
       )
    , page(source_id, rn, ct, url) AS (
       SELECT 0, 0, 0, ''::text     -- dummy init row
       UNION ALL
       SELECT s.source_id, s.rn
          , CASE WHEN t.url <> ''
                 THEN p.ct + 1
                 ELSE p.ct END      -- only inc. if url found last round
          , (SELECT url
             FROM   webpage
             WHERE  source_id = t.source_id
             AND   (last >= refreshFrequency) IS NOT TRUE
             AND    locked = FALSE  -- may not be needed
             AND    pg_try_advisory_xact_lock(url)  -- only true is free
             LIMIT  1               -- get 1 URL per source
            ) AS url                -- try, may come up empty
       FROM   page p
       JOIN   s ON s.rn = p.rn + 1
       WHERE  CASE WHEN p.url <> ''
                   THEN p.ct + 1
                   ELSE p.ct END < limit  -- your limit here
       )
    SELECT url
    FROM   page
    WHERE  url <> '';               -- exclude '' and NULL
    

    Alternatively, if you need to manage locked, too, use this query with the above UPDATE.