sqlpostgresqlconcurrencymany-to-manyupsert

How to insert many rows of tags based on input array?


I have a table of posts, a table of hashtags, and a table linking posts to hashtags, like so:

CREATE TABLE posts(
id SERIAL PRIMARY KEY,
post_data VARCHAR(128) NOT NULL
);

CREATE TABLE hashtags (
id SERIAL PRIMARY KEY,
value VARCHAR(128) NOT NULL
);

CREATE TABLE post_hashtags(
id SERIAL PRIMARY KEY,
post_id INTEGER NOT NULL REFERENCES posts(id),
hashtag_id INTEGER NOT NULL REFERENCES hashtag(id)
);

INSERT INTO posts(post_data) VALUES ('post1');
INSERT INTO hashtags (value) VALUES ('hashtag1'), ('hashtag2'), ('hashtag3');

-- At this point I might want to add links between post1 and existing hashtags as well as possibly new ones

Whenever a user makes a post using several hash tags, I want to:

  1. Create a new row in posts and get the ID
  2. Create new rows in hashtags for hashtags that don't exist and get their IDs
  3. For each hashtag, create a row in post_hashtags with the post and hashtag ids

Right now, I'm capable of handling it on the server side but that's terrible performance, obviously: insert post and get ID; for each hashtag, insert into hashtags if it doesn't exist, get the ID; then insert into the post_hashtags. Lots of calls to DB that I'm guessing could be simplified, but my SQL skills are lacking at the moment.


Solution

  • You can do this safely and efficiently with a single query using data-modifying CTEs, and an UPSERT for the tags.

    Of course, you need a UNIQUE or PRIMARY KEY constraint on hashtags (value). (And you should have one on post_hashtags (post_id, hashtag_id).)

    WITH input(post_data, tags) AS (  -- provide single data row with array of tags
       VALUES ('post2', '{hashtag1, hashtag2, hashtag4}'::text[])  -- single post!
       )
    , tag_set AS (  -- unnest tags - may be empty/missing (?)
       SELECT unnest(i.tags) AS value
       FROM   input i
       )
    , ins_p AS (
       INSERT INTO posts (post_data)
       SELECT i.post_data
       FROM   input i
       RETURNING id AS post_id
       )
    , ins_h AS (
       INSERT INTO hashtags (value)
       SELECT t.value
       FROM   tag_set t
       WHERE  NOT EXISTS (SELECT FROM hashtags h WHERE h.value = t.value)  -- optional to avoid burning lots of serial IDs
       ON CONFLICT (value) DO NOTHING
       RETURNING id AS hashtag_id
       )
    INSERT INTO post_hashtags
          (  post_id,   hashtag_id)
    SELECT p.post_id, t.hashtag_id
    FROM   ins_p p
    CROSS  JOIN (  -- only if actual tags were entered
       TABLE ins_h  -- new tags
       UNION  ALL
       SELECT h.id AS hastag_id  -- pre-existing tags
       FROM   tag_set  t
       JOIN   hashtags h USING (value)
       ) t
    RETURNING *;
    

    fiddle

    This is safe, even under heavy concurrent write load.
    Two possible corner cases remain:

    1. A concurrent transaction might create the same new hashtag, but then roll back, resulting in a missing entry in table hashtags for this query, raising an exception.

    2. A pre-existing hashtag might be deleted by a concurrent transaction between looking it up here, and inserting into post_hashtags. Again, raising an exception. Extremely unlikely, but possible.

    You might just re-run this query if one of these two unlikely issues arises.
    Or you cover both corner cases in the same query. I did not bother to go that far. See instructions and detailed explanation here:

    Related: