postgresqldatabase-indexespostgresql-10pg-trgm

Trigram Index ORDER BY optimization


I am trying to implement a search function and after some investigation (see this interesting read by Yorick Peterse at GitLab) I decided I would opt for the trigram approach using the pg_trgm extension.

I'd like to return the 10 most relevant rows.

Here are a couple of queries I tested (following the doc) against a table with 110868 rows:

SELECT name, similarity(name, 'search query') AS sml
FROM table
ORDER BY sml DESC, name;
Time: 701.814 ms

SELECT name, similarity(name, 'search query') AS sml
FROM table
WHERE name % 'search query'
ORDER BY sml DESC, name;
Time: 376.692 ms

SELECT name, similarity(name, 'search query') AS sml
FROM table
WHERE name % 'search query'
ORDER BY sml DESC, name LIMIT 10;
Time: 378.921 ms

With a GiST index:

CREATE INDEX trigram_index ON table USING GIST (name gist_trgm_ops);
SELECT name, similarity(name, 'search query') AS sml
FROM table
WHERE name % 'search query'
ORDER BY sml DESC, name LIMIT 10;
Time: 36.877 ms

With a GIN index:

CREATE INDEX trigram_index ON table USING GIN (name gin_trgm_ops);
SELECT name, similarity(name, 'search query') AS sml
FROM table WHERE name % 'search query'
ORDER BY sml DESC, name LIMIT 10;
Time: 18.992 ms

With EXPLAIN ANALYZE:

 Limit  (cost=632.37..632.39 rows=10 width=25) (actual time=22.202..22.204 rows=10 loops=1)
   ->  Sort  (cost=632.37..632.64 rows=111 width=25) (actual time=22.201..22.201 rows=10 loops=1)
         Sort Key: (similarity((name)::text, 'search query'::text)) DESC, name
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Bitmap Heap Scan on table  (cost=208.86..629.97 rows=111 width=25) (actual time=6.900..22.157 rows=134 loops=1)
               Recheck Cond: ((name)::text % 'search query'::text)
               Rows Removed by Index Recheck: 2274
               Heap Blocks: exact=2257
               ->  Bitmap Index Scan on trigram_index  (cost=0.00..208.83 rows=111 width=0) (actual time=6.532..6.532 rows=2408 loops=1)
                     Index Cond: ((name)::text % 'World of Warcraft'::text)
 Planning time: 0.073 ms
 Execution time: 18.521 ms

Using a GIN index considerably improves the performance. Limiting the result to 10 rows doesn't seem, however, to have any impact.

Is there still room for improvement I haven't considered? I am especially interested in suggestions that would harness the fact I only need a small subset of the whole table.


Solution

  • As the documentation says, a GIN index won't help to optimize the ORDER BY clause:

    A variant of the above query is

    SELECT t, t <-> 'word' AS dist
    FROM test_trgm
    ORDER BY dist LIMIT 10;
    

    This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

    On the other hand, GIN indexes often perform better than GiST indexes for larger tables.

    So I think you should try both and just use the one that is faster on a realistically sized test table.

    I don't think that you can improve that much more, except by using more RAM to cache the data.