Are there any additional steps I can take to speed up query execution?
I have a table with more than 100m rows and I need to do search for matching strings. For that I checked two options:
@@
(to_tsquery or plainto_tsquery)I found that I can use indexes to improve performance.
For my GiST index I tried to increase siglen
from small number to 2024, but for some reason Postgres uses 512
and not higher.
CREATE INDEX trgm_idx_512_gg ON table USING GIST (name gist_trgm_ops(siglen=512));
Query:
SELECT name, similarity(name, 'ноутбук MSI GF63 Thin 10SC 086XKR 9S7 16R512 086') as sm
FROM table
WHERE name % 'ноутбук MSI GF63 Thin 10SC 086XKR 9S7 16R512 086'
EXPLAIN
output:
Bitmap Heap Scan on table (cost=1632.01..40051.57 rows=9737 width=126)
Recheck Cond: ((name)::text % 'ноутбук MSI GF63 Thin 10SC 086XKR 9S7 16R512 086'::text)
-> Bitmap Index Scan on trgm_idx_512_gg (cost=0.00..1629.57 rows=9737 width=0)
Index Cond: ((name)::text % 'ноутбук MSI GF63 Thin 10SC 086XKR 9S7 16R512 086'::text)
Execution time was about 120 sec.
How can I improve or speed up query? Maybe I need to use a different approach or just add something else?
Output for EXPLAIN (ANALYZE, BUFFERS)
(searching for a different name so that the search is completely new and not from the cache):
Bitmap Heap Scan on table (cost=1632.01..40051.57 rows=9737 width=126) (actual time=159119.258..159960.251 rows=5645 loops=1)
Recheck Cond: ((name)::text % 'Чехол на realme C25s / Реалми Ц25с c рисунком / прозрачный с принтом, Andy&Paul'::text)
Heap Blocks: exact=3795
Buffers: shared read=1289378
-> Bitmap Index Scan on trgm_idx_512_gg (cost=0.00..1629.57 rows=9737 width=0) (actual time=159118.616..159118.616 rows=5645 loops=1)
Index Cond: ((name)::text % 'Чехол на realme C25s / Реалми Ц25с c рисунком / прозрачный с принтом, Andy&Paul'::text)
Buffers: shared read=1285583
Planning:
Buffers: shared read=5
Planning Time: 4.063 ms
Execution Time: 159961.121 ms
I also created a GIN index (but Postgres kept using the GiST):
CREATE INDEX gin_gg ON table USING GIN (name gin_trgm_ops);
Size: 12 GB.
GIST index: 31GB
A trigram GiST index with siglen=512
on 100 million rows is very large, and will probably never be cached efficiently. Default is siglen=12
i.e. 12 bytes. What makes you think this large signature would be a good choice? The manual:
Longer signatures lead to a more precise search (scanning a smaller fraction of the index and fewer heap pages), at the cost of a larger index.
Looks like you went overboard with the size.
I have better experience with trigram GIN indexes, especially in current versions of Postgres. If the query planner is confused by the existence of an additional GiST index, remove that one to test results with the GIN index.
But first, to get a size comparison, look at the output of:
SELECT i.indexrelid::regclass::text AS idx
, pg_get_indexdef(i.indexrelid) AS idx_def
, pg_size_pretty(pg_relation_size(i.indexrelid)) AS idx_size
FROM pg_class t
JOIN pg_index i ON i.indrelid = t.oid
WHERE t.oid = 'public.tbl'::regclass -- your table name here!
ORDER BY 1;
Your query plan shows vast amounts of Buffers: shared read
for index and main relation (heap). So nothing was found in cache. The key to better performance will be to read fewer data pages to satisfy your queries, and more of them from cache: hit
instead of read
in the query plan.
Reducing the size of table and indexes helps in this reagard.
The selectivity of the trigram similarity operator %
is set by the customized option pg_trgm.similarity_threshold
. The default 0.3
is rather lax and allows many hits. A higher similarity threshold will filter fewer (better matching) result rows. What do you do with rows=5645
result rows anyway? Try:
SET pg_trgm.similarity_threshold = 0.5; -- or higher
Then retry your query.
See:
The latest Postgres version, better server configuration, and more RAM can also help in this regard. You disclosed no information there.