Currently, I have the follow real use case challenge.
Library admins want to search for books in their library database using book and category names. Their database has over 2 million book records, and each book is associated with 10 categories. There are around 100 categories in total.
The DB schema is the following:
Create tables queries:
-- Create the 'book' table
CREATE TABLE book (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL
);
-- Create the 'categories' table
CREATE TABLE categories (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL
);
-- Create the join table 'book_categories' to model the many-to-many relationship
CREATE TABLE book_categories (
book_id INT NOT NULL,
category_id INT NOT NULL,
PRIMARY KEY (book_id, category_id),
CONSTRAINT fk_book
FOREIGN KEY (book_id)
REFERENCES book (id)
ON DELETE CASCADE,
CONSTRAINT fk_category
FOREIGN KEY (category_id)
REFERENCES categories (id)
ON DELETE CASCADE
);
I created GIN indexes for the book and category name DB field:
CREATE INDEX book_fts_idx ON book USING gin (to_tsvector('english', name));
CREATE INDEX category_fts_idx ON category USING gin (to_tsvector('english', name));
For test purposes, let's use the search term 'category'. This search term is querying 41 rows within 9 to 10 seconds, which I believe is a huge time response. Since I'm displaying the query result on an UI page, I also have pagination. The count(*) query is faster for this search term.
Using another search term, 'romance' this time, which has over 1.5kk records in the book categories table, I have no delay when querying the results, but now the count(*) is quite delayed, it takes around 5 seconds. It's clear for me why I have these differences between the results of both search terms.
Here are the queries I'm using:
To query the search term results:
SELECT "book"."id" AS "book_id",
"book"."name" AS "book_name",
"bookCategory"."book_id" AS "bookCategory_book_id",
"bookCategory"."category_id" AS "bookCategory_category_id",
"category"."id" AS "category_id",
"category"."name" AS "category_name"
FROM "book" "book"
LEFT JOIN "book_categories" "bookCategory" ON "bookCategory"."book_id" = "book"."id"
LEFT JOIN "category" "category" ON "category"."id" = "bookCategory"."category_id"
WHERE to_tsvector('english', "book"."name") @@ to_tsquery('english', 'romance')
OR EXISTS (
SELECT 1
FROM book_categories bc
JOIN category c ON bc.category_id = c.id
WHERE bc.book_id = "book"."id"
AND to_tsvector('english', c.name) @@ to_tsquery('english', 'romance')
);
The count query:
SELECT COUNT(*)
FROM (
SELECT id
FROM book
WHERE to_tsvector('english', name) @@ to_tsquery('english', 'romance')
UNION
SELECT b.id
FROM book b
JOIN book_categories bc ON bc.book_id = b.id
JOIN category c ON c.id = bc.category_id
WHERE to_tsvector('english', c.name) @@ to_tsquery('english', 'romance')
) AS matching_books;
Running an explain analyze
in both queries, I see that the GIN indexes aren't used. It makes sense since my queries have OR
and EXISTS
operators. However, I need them to be able to meet user requirements.
I'm wondering if there's a way to standardize the results of my queries. It might be a technical limitation, and I may need to consider other strategies.
If you guys want to reproduce this scenario, here are a few prodecures to add data to the DB:
Populate the book table:
CREATE OR REPLACE PROCEDURE public.populate_books()
LANGUAGE plpgsql
AS $procedure$
DECLARE
i INTEGER := 1;
random_index INTEGER;
names TEXT[] := ARRAY[
'The Whispering Fog',
'Echoes of a Shattered Sky',
'Beneath the Crimson Leaves',
'The Algorithm of Souls',
'Lanterns in the Abyss',
'Chronicles of the Forgotten Gate',
'Paper Moons and Glass Hearts',
'The Clockmaker’s Paradox',
'Harvest of the Hollow Moon',
'Voices Beyond the Veil'
];
BEGIN
WHILE i <= 2000000 LOOP
random_index := floor(random() * array_length(names, 1)) + 1;
INSERT INTO book(name) VALUES (names[random_index] || '-' || i);
i := i + 1;
END LOOP;
END;
$procedure$
;
Populate the category table:
CREATE OR REPLACE PROCEDURE public.populate_categories()
LANGUAGE plpgsql
AS $procedure$
DECLARE
i INTEGER := 1;
random_index INTEGER;
categories TEXT[] := ARRAY[
'Romance',
'Mystery & Thriller',
'Science Fiction',
'Fantasy',
'Historical Fiction',
'Biography & Memoir',
'Self-Help',
'Young Adult',
'Horror',
'Non-Fiction'
];
BEGIN
WHILE i <= 100 LOOP
random_index := floor(random() * array_length(categories, 1)) + 1;
INSERT INTO category(name) VALUES (categories[random_index]);
i := i + 1;
END LOOP;
END;
$procedure$
;
Populate the book categories table:
CREATE OR REPLACE PROCEDURE public.populate_book_categories()
LANGUAGE plpgsql
AS $procedure$
DECLARE
v_book_id INT;
cur_books CURSOR FOR SELECT id FROM book;
BEGIN
OPEN cur_books;
LOOP
FETCH cur_books INTO v_book_id;
EXIT WHEN NOT FOUND;
-- For the current book, insert 10 random categories
INSERT INTO book_categories (book_id, category_id)
SELECT v_book_id, c.id
FROM category c
ORDER BY random()
LIMIT 10;
END LOOP;
CLOSE cur_books;
END;
$procedure$
;
I'm running these queries on a machine with 64GB of RAM and a 16-core CPU. However, when running them in production, the searches take even longer to complete due to hardware limitations.
EDIT with the explain analyze results:
Explain when using 'romance' as search term. It has over 1.5kk records. The select query is fast for this one:
Hash Left Join (cost=50.24..35399184.98 rows=10050162 width=62) (actual time=89.004..21577.881 rows=14775686 loops=1)
Hash Cond: ("bookCategory".category_id = category.id)
-> Merge Left Join (cost=45.96..35371692.01 rows=10050162 width=45) (actual time=88.967..20397.038 rows=14775686 loops=1)
Merge Cond: (book.id = "bookCategory".book_id)
-> Index Scan using "PK_a3afef72ec8f80e6e5c310b28a4" on book (cost=0.43..34610771.81 rows=1005003 width=37) (actual time=88.949..18146.927 rows=1477569 loops=1)
Filter: ((to_tsvector('english'::regconfig, (name)::text) @@ '''romanc'''::tsquery) OR (SubPlan 1))
Rows Removed by Filter: 522437
SubPlan 1
-> Nested Loop (cost=8.44..17.01 rows=1 width=0) (actual time=0.006..0.006 rows=1 loops=2000006)
-> Bitmap Heap Scan on category c (cost=8.00..12.27 rows=1 width=4) (actual time=0.001..0.001 rows=7 loops=2000006)
Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''romanc'''::tsquery)
Heap Blocks: exact=2000006
-> Bitmap Index Scan on category_fts_idx (cost=0.00..8.00 rows=1 width=0) (actual time=0.001..0.001 rows=12 loops=2000006)
Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''romanc'''::tsquery)
-> Index Only Scan using "PK_76506a56b5e205f79d9cdfc39ef" on book_categories bc (cost=0.44..4.46 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=14153182)
Index Cond: ((book_id = book.id) AND (category_id = c.id))
Heap Fetches: 6
-> Index Only Scan using "PK_76506a56b5e205f79d9cdfc39ef" on book_categories "bookCategory" (cost=0.44..607905.27 rows=20000322 width=8) (actual time=0.012..1090.640 rows=20000007 loops=1)
Heap Fetches: 59
-> Hash (cost=3.01..3.01 rows=101 width=17) (actual time=0.023..0.024 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Seq Scan on category (cost=0.00..3.01 rows=101 width=17) (actual time=0.004..0.009 rows=101 loops=1)
Planning Time: 2.448 ms
JIT:
Functions: 23
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 2.567 ms, Inlining 6.400 ms, Optimization 52.799 ms, Emission 29.622 ms, Total 91.388 ms
Execution Time: 21940.311 ms
Explain for the count query. This one takes around 4, 5 seconds to complete:
Aggregate (cost=303878.41..303878.42 rows=1 width=8) (actual time=2631.373..2631.518 rows=1 loops=1)
-> Unique (cost=300238.00..301278.12 rows=208023 width=4) (actual time=2183.411..2589.611 rows=1477569 loops=1)
-> Sort (cost=300238.00..300758.06 rows=208023 width=4) (actual time=2183.409..2438.316 rows=2397789 loops=1)
Sort Key: book.id
Sort Method: external merge Disk: 32936kB
-> Gather (cost=1012.71..279017.43 rows=208023 width=4) (actual time=6.317..1441.822 rows=2397789 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Append (cost=12.71..257215.13 rows=208023 width=4) (actual time=5.865..1457.075 rows=799263 loops=3)
-> Parallel Bitmap Heap Scan on book (cost=109.50..22132.63 rows=4167 width=4) (actual time=0.129..0.129 rows=0 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''romanc'''::tsquery)
-> Bitmap Index Scan on book_fts_idx (cost=0.00..107.00 rows=10000 width=0) (actual time=0.022..0.022 rows=0 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''romanc'''::tsquery)
-> Nested Loop (cost=12.71..231962.15 rows=82510 width=4) (actual time=5.821..1419.794 rows=799263 loops=3)
-> Hash Join (cost=12.28..194637.23 rows=82510 width=4) (actual time=5.772..631.357 rows=799263 loops=3)
Hash Cond: (bc.category_id = c.id)
-> Parallel Seq Scan on book_categories bc (cost=0.00..171831.67 rows=8333468 width=8) (actual time=0.008..287.786 rows=6666680 loops=3)
-> Hash (cost=12.27..12.27 rows=1 width=4) (actual time=5.679..5.680 rows=12 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Bitmap Heap Scan on category c (cost=8.00..12.27 rows=1 width=4) (actual time=5.661..5.666 rows=12 loops=3)
Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''romanc'''::tsquery)
Heap Blocks: exact=1
-> Bitmap Index Scan on category_fts_idx (cost=0.00..8.00 rows=1 width=0) (actual time=0.009..0.009 rows=12 loops=3)
Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''romanc'''::tsquery)
-> Index Only Scan using "PK_a3afef72ec8f80e6e5c310b28a4" on book b (cost=0.43..0.45 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=2397789)
Index Cond: (id = bc.book_id)
Heap Fetches: 2397789
Planning Time: 0.867 ms
JIT:
Functions: 60
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.370 ms, Inlining 0.000 ms, Optimization 0.961 ms, Emission 16.041 ms, Total 19.372 ms
Execution Time: 2635.360 ms
Explain results when using the search term 'category'. This search term only have 40-50 records in the DB. The select query is slow for this one:
Hash Left Join (cost=50.24..35399184.98 rows=10050162 width=62) (actual time=10596.360..10596.443 rows=41 loops=1)
Hash Cond: ("bookCategory".category_id = category.id)
-> Merge Left Join (cost=45.96..35371692.01 rows=10050162 width=45) (actual time=10596.297..10596.376 rows=41 loops=1)
Merge Cond: (book.id = "bookCategory".book_id)
-> Index Scan using "PK_a3afef72ec8f80e6e5c310b28a4" on book (cost=0.43..34610771.81 rows=1005003 width=37) (actual time=8981.994..8982.060 rows=6 loops=1)
Filter: ((to_tsvector('english'::regconfig, (name)::text) @@ '''categori'''::tsquery) OR (SubPlan 1))
Rows Removed by Filter: 2000000
SubPlan 1
-> Nested Loop (cost=8.44..17.01 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=2000006)
-> Bitmap Heap Scan on category c (cost=8.00..12.27 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=2000006)
Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''categori'''::tsquery)
Heap Blocks: exact=2000006
-> Bitmap Index Scan on category_fts_idx (cost=0.00..8.00 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=2000006)
Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''categori'''::tsquery)
-> Index Only Scan using "PK_76506a56b5e205f79d9cdfc39ef" on book_categories bc (cost=0.44..4.46 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=2000006)
Index Cond: ((book_id = book.id) AND (category_id = c.id))
Heap Fetches: 6
-> Index Only Scan using "PK_76506a56b5e205f79d9cdfc39ef" on book_categories "bookCategory" (cost=0.44..607905.27 rows=20000322 width=8) (actual time=0.031..1018.744 rows=20000041 loops=1)
Heap Fetches: 93
-> Hash (cost=3.01..3.01 rows=101 width=17) (actual time=0.039..0.039 rows=101 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Seq Scan on category (cost=0.00..3.01 rows=101 width=17) (actual time=0.012..0.016 rows=101 loops=1)
Planning Time: 1.552 ms
JIT:
Functions: 23
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.632 ms, Inlining 5.905 ms, Optimization 35.354 ms, Emission 23.646 ms, Total 66.537 ms
Execution Time: 10598.281 ms
Explain for the count query. This one is fast:
Aggregate (cost=303878.41..303878.42 rows=1 width=8) (actual time=550.233..554.165 rows=1 loops=1)
-> Unique (cost=300238.00..301278.12 rows=208023 width=4) (actual time=550.196..554.132 rows=6 loops=1)
-> Sort (cost=300238.00..300758.06 rows=208023 width=4) (actual time=550.195..554.127 rows=6 loops=1)
Sort Key: book.id
Sort Method: quicksort Memory: 25kB
-> Gather (cost=1012.71..279017.43 rows=208023 width=4) (actual time=550.033..554.113 rows=6 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Append (cost=12.71..257215.13 rows=208023 width=4) (actual time=449.180..534.723 rows=2 loops=3)
-> Parallel Bitmap Heap Scan on book (cost=109.50..22132.63 rows=4167 width=4) (actual time=0.135..0.136 rows=0 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''categori'''::tsquery)
-> Bitmap Index Scan on book_fts_idx (cost=0.00..107.00 rows=10000 width=0) (actual time=0.021..0.021 rows=0 loops=1)
Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''categori'''::tsquery)
-> Nested Loop (cost=12.71..231962.15 rows=82510 width=4) (actual time=449.134..534.675 rows=2 loops=3)
-> Hash Join (cost=12.28..194637.23 rows=82510 width=4) (actual time=449.112..534.651 rows=2 loops=3)
Hash Cond: (bc.category_id = c.id)
-> Parallel Seq Scan on book_categories bc (cost=0.00..171831.67 rows=8333468 width=8) (actual time=0.012..269.890 rows=6666680 loops=3)
-> Hash (cost=12.27..12.27 rows=1 width=4) (actual time=7.562..7.563 rows=1 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Bitmap Heap Scan on category c (cost=8.00..12.27 rows=1 width=4) (actual time=7.546..7.551 rows=1 loops=3)
Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''categori'''::tsquery)
Heap Blocks: exact=1
-> Bitmap Index Scan on category_fts_idx (cost=0.00..8.00 rows=1 width=0) (actual time=0.007..0.007 rows=1 loops=3)
Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''categori'''::tsquery)
-> Index Only Scan using "PK_a3afef72ec8f80e6e5c310b28a4" on book b (cost=0.43..0.45 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=6)
Index Cond: (id = bc.book_id)
Heap Fetches: 6
Planning Time: 1.114 ms
JIT:
Functions: 60
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.302 ms, Inlining 0.000 ms, Optimization 1.216 ms, Emission 21.443 ms, Total 24.961 ms
Execution Time: 555.442 ms
The COUNT
query can't be improved, it's already using the FTS indexes in both cases.
The other query can be improved by changing the OR
to a UNION
. Also it's dealing with a huge join on category
, and would probably be helped by aggregating category
into a subquery.
So add the category
after unioning, in an aggregated JSON subquery.
SELECT
b.id AS book_id,
b.name AS book_name,
(
SELECT
jsonb_agg(to_jsonb(c.*))
FROM book_categories bc ON bc.book_id = b.id
JOIN category c ON c.id = bc.category_id
WHERE bc.book_id = b.id
) AS categories
FROM (
SELECT
b.id,
b.name
FROM book b
WHERE to_tsvector('english', b.name) @@ to_tsquery('english', 'romance')
UNION
SELECT
b.id,
b.name
FROM book b
WHERE EXISTS (SELECT
FROM book_categories bc ON bc.book_id = b.id
JOIN category c ON c.id = bc.category_id
WHERE bc.book_id = b.id
AND to_tsvector('english', c.name) @@ to_tsquery('english', 'romance')
)
) b;
Or remove the left-join on category
and don't return it all:
SELECT
b.id AS book_id,
b.name AS book_name
FROM book b
WHERE to_tsvector('english', b.name) @@ to_tsquery('english', 'romance')
UNION
SELECT
b.id AS book_id,
b.name AS book_name
FROM book b
WHERE EXISTS (SELECT
FROM book_categories bc ON bc.book_id = b.id
JOIN category c ON c.id = bc.category_id
WHERE bc.book_id = b.id
AND to_tsvector('english', c.name) @@ to_tsquery('english', 'romance')
);