I have a table of posts, a table of hashtags, and a table linking posts to hashtags, like so:
CREATE TABLE posts(
id SERIAL PRIMARY KEY,
post_data VARCHAR(128) NOT NULL
);
CREATE TABLE hashtags (
id SERIAL PRIMARY KEY,
value VARCHAR(128) NOT NULL
);
CREATE TABLE post_hashtags(
id SERIAL PRIMARY KEY,
post_id INTEGER NOT NULL REFERENCES posts(id),
hashtag_id INTEGER NOT NULL REFERENCES hashtag(id)
);
INSERT INTO posts(post_data) VALUES ('post1');
INSERT INTO hashtags (value) VALUES ('hashtag1'), ('hashtag2'), ('hashtag3');
-- At this point I might want to add links between post1 and existing hashtags as well as possibly new ones
Whenever a user makes a post using several hash tags, I want to:
posts
and get the IDhashtags
for hashtags that don't exist and get their IDspost_hashtags
with the post and hashtag idsRight now, I'm capable of handling it on the server side but that's terrible performance, obviously: insert post and get ID; for each hashtag, insert into hashtags
if it doesn't exist, get the ID; then insert into the post_hashtags
. Lots of calls to DB that I'm guessing could be simplified, but my SQL skills are lacking at the moment.
You can do this safely and efficiently with a single query using data-modifying CTEs, and an UPSERT for the tags.
Of course, you need a UNIQUE
or PRIMARY KEY
constraint on hashtags (value)
. (And you should have one on post_hashtags (post_id, hashtag_id)
.)
WITH input(post_data, tags) AS ( -- provide single data row with array of tags
VALUES ('post2', '{hashtag1, hashtag2, hashtag4}'::text[]) -- single post!
)
, tag_set AS ( -- unnest tags - may be empty/missing (?)
SELECT unnest(i.tags) AS value
FROM input i
)
, ins_p AS (
INSERT INTO posts (post_data)
SELECT i.post_data
FROM input i
RETURNING id AS post_id
)
, ins_h AS (
INSERT INTO hashtags (value)
SELECT t.value
FROM tag_set t
WHERE NOT EXISTS (SELECT FROM hashtags h WHERE h.value = t.value) -- optional to avoid burning lots of serial IDs
ON CONFLICT (value) DO NOTHING
RETURNING id AS hashtag_id
)
INSERT INTO post_hashtags
( post_id, hashtag_id)
SELECT p.post_id, t.hashtag_id
FROM ins_p p
CROSS JOIN ( -- only if actual tags were entered
TABLE ins_h -- new tags
UNION ALL
SELECT h.id AS hastag_id -- pre-existing tags
FROM tag_set t
JOIN hashtags h USING (value)
) t
RETURNING *;
This is safe, even under heavy concurrent write load.
Two possible corner cases remain:
A concurrent transaction might create the same new hashtag, but then roll back, resulting in a missing entry in table hashtags
for this query, raising an exception.
A pre-existing hashtag might be deleted by a concurrent transaction between looking it up here, and inserting into post_hashtags
. Again, raising an exception. Extremely unlikely, but possible.
You might just re-run this query if one of these two unlikely issues arises.
Or you cover both corner cases in the same query. I did not bother to go that far. See instructions and detailed explanation here:
Related: