In a project, we are using postgresql v12.11 (updating is sadly not an option at the moment).
Consider the following relations:
document (
id uuid primary key
)
page (
id uuid primary key,
document_id uuid not null references document
)
word (
id uuid primary key,
page_id uuid not null references page,
text text not null
)
I had to implement a feature to find words across given documents by a given similarity threshold quickly, and after some research chose to use trigram matching with a GIN index.
This are my indices:
CREATE INDEX word_page_id_idx ON word (page_id);
CREATE INDEX word_text_idx ON word USING GIN (text gin_trgm_ops);
CREATE INDEX page_document_id ON page (document_id);
This is the query I came up with:
SELECT w.*, word_similarity(:searchString, w.text) similarity
FROM document d
INNER JOIN page p ON p.document_id = d.id
INNER JOIN word w ON w.page_id = p.id
WHERE d.id IN (:documentIds)
AND :searchString <% w.text
ORDER BY similarity DESC, w.text;
It worked very good in the beginning, but now with ~20kk rows in the words table it starts to get slower, especially with multiple documents.
My knowledge about indices and the query planner is not deep enough to understand what exactly is going on here, or why the query is getting slower.
Here is what EXPLAIN ANALYZE gives me for a query with 18 document ids. I wonder why the plan considers 185700 word rows, although the given documents in total have only around 9k words associated with them, but where do I proceed from here? AI suggested to create a combined GIN index on (page_id, text) which does not work...
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Sort (cost=11351.77..11351.79 rows=8 width=63) (actual time=3825.241..3825.247 rows=74 loops=1) |
| Sort Key: (word_similarity('05.06.2025'::text, w.text)) DESC, w.text |
| Sort Method: quicksort Memory: 35kB |
| -> Nested Loop (cost=248.89..11351.65 rows=8 width=63) (actual time=130.456..3825.106 rows=74 loops=1) |
| -> Nested Loop (cost=0.83..138.23 rows=45 width=16) (actual time=0.517..4.050 rows=28 loops=1) |
| -> Index Only Scan using document_pkey on document d (cost=0.41..31.98 rows=18 width=16) (actual time=0.010..0.217 rows=18 loops=1) |
| Index Cond: (id = ANY ('{470d35da-a65e-4eb5-aae1-1ce9334ff617,44f05ab8-94f7-4e90-9008-a224b0f2c458,c9b8161d-2975-4902-884f-11c658960ca0,9315c0ec-b460-4044-b4a3-79b77f40faea,b6e0a7ef-37c3-4540-aeb9-424567e3c51f,0b0adbbe-7d70-484d-a589-f5952dd9c4ae,5f51a47e-11cc-4bf4-a3f6-d1452262d82f,099dfc29-1803-4a87-8820-891697b26047,3e6a8d9c-a40b-4980-a49e-8285eee4dedc,4078b105-aff4-478d-97ae-2b785b1bdd08,30b7a7f6-b67a-4f0c-8cfb-e686a01b8e89,7265f939-7f47-4264-8a72-43f71312ba74,3e03b9b5-2c24-4d7a-b063-5bec34ad6e1e,23f69244-81a9-4dbc-984d-21bfd2bd0147,39837f1c-491e-4b43-9c86-713c7b16ec7a,5ba1dce8-97e0-4e22-8685-e018c4dbbf31,15e34009-1ed7-470a-9b16-34fc416595eb,9fb7604b-eafd-49be-b8cf-d9f2c5d663bf}'::uuid[]))|
| Heap Fetches: 0 |
| -> Index Scan using page_document_id on page p (cost=0.42..5.86 rows=4 width=32) (actual time=0.203..0.209 rows=2 loops=18) |
| Index Cond: (document_id = d.id) |
| -> Bitmap Heap Scan on word ow (cost=248.06..249.18 rows=1 width=59) (actual time=136.266..136.449 rows=3 loops=28) |
| Recheck Cond: (page_id = p.id) |
| Filter: ('05.06.2025'::text <% text) |
| Rows Removed by Filter: 3 |
| Heap Blocks: exact=50 |
| -> BitmapAnd (cost=248.06..248.06 rows=1 width=0) (actual time=135.254..135.254 rows=0 loops=28) |
| -> Bitmap Index Scan on word_page_id_idx (cost=0.00..9.16 rows=761 width=0) (actual time=0.021..0.021 rows=251 loops=28) |
| Index Cond: (page_id = p.id) |
| -> Bitmap Index Scan on word_text_idx (cost=0.00..233.26 rows=21568 width=0) (actual time=135.229..135.229 rows=185700 loops=28) |
| Index Cond: (text %> '05.06.2025'::text) |
|Planning Time: 26.187 ms |
|Execution Time: 3825.320 ms |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
If you're interested in top-N searches, use a GiST index, not GIN. F.33.4. Index Support:
This will return all values in the text column that are sufficiently similar to
word, sorted from best match to worst. The index will be used to make this a fast operation even over very large data sets.
A variant of the above query is (...)
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.
You can also enable pg_trgm to work in tandem with btree_gist to combine your indexes:
demo at db<>fiddle
DROP INDEX IF EXISTS word_text_idx;
DROP INDEX IF EXISTS word_page_id_idx;
CREATE INDEX word_text_pageid_idx ON word
USING GiST (text gist_trgm_ops, page_id)--worth testing with flipped order
INCLUDE(id)--enables index-only scan at the cost of larger index footprint
WITH(fillfactor=100);--only if documents are static or batch-loaded
This should simplify the bottom/innermost part of your plan. Analogue btree_gin is also available, if you'd like to test the same with a GIN index.
I don't think you need to involve document, at all. The document.id is already present in page.document_id which you need to scan anyways, and you're not selecting anything else from document. If you're really only interested in the contents of word, cut out document:
SELECT w.*, :searchString <<-> w.text AS word_difference
FROM page p
JOIN word w ON w.page_id = p.id
WHERE p.document_id IN (:documentIds)
AND :searchString <% w.text
ORDER BY word_difference ASC, w.text
LIMIT 20;--if you don't really need all 74
An important note is that trigrams only include [:alnum:] class characters (whitespace and punctuation are ignored), so the . dots in '05.06.2025' only contribute to trigram boundaries but aren't included themselves. With default pg_trgm.word_similarity_threshold of 0.6, you might be catching surprising matches:
| v | word_similarity | t<<->v | is caught by t <% v |
|---|---|---|---|
| 06.2025.05 | 1 | 0 | t |
| 05.06.2025 | 1 | 0 | t |
| 06.05.2025 | 1 | 0 | t |
| 05.2025.06 | 1 | 0 | t |
| 025.20.05.06 | 0.8181818 | 0.18181819 | t |
| 07.06.2025 | 0.8 | 0.19999999 | t |
| 00.2025.06 | 0.8 | 0.19999999 | t |
| 05.2025 | 0.8 | 0.19999999 | t |
| 06.07.2025 | 0.6666667 | 0.3333333 | t |
| 25.06.2005 | 0.6 | 0.39999998 | t |
| 25.05.2006 | 0.6 | 0.39999998 | t |
As already pointed out by the commenters, none of this will help much unless your search is selective enough.
If you're looking for a date, you can filter your word table based on pg_input_isvalid("text",'date') which lets you safely cast "text"::date on the subset, then select
date(multi)range.Same applies to any data type: you could search documents for an IP address in a given network, a timestamp between specific ::times of day, date on a given weekday, long enough interval, a numeric of a desired scale.
Worth mentioning, but to make it fast, you'd have to extract and index data for each target type, at which point you're stepping into semantic analysis and might as well add pgvector to the mix.