postgresqlpg-trgm

Will trigram index on multiple columns make search faster and how to make such search properly?


Let's assume I have table with multiple columns. For instance:

id int
name text
surname text
cars json

example record would be

+----+------+---------+------------------------------------+
| id | name | surname |              cars                  |
+----+------+---------+------------------------------------+
|  1 | John | Doe     | {"values":["Ford", "BMW", "Fiat"]} |
+----+------+---------+------------------------------------+

I want to search all this table data for relevance like this:

select *,
       similarity(
          'Malcolm Joe likes Ferrary, but hates BMW',
          (name || (cars ->> 'values') || surname)
       ) sim
from public.test_table
where similarity(
         'Malcolm Joe likes Ferrary, but hates BMW',
         (name || (cars ->> 'values') || surname)
      ) > 0.05
order by sim desc;

Is there any way to speed up this search? Creating a trigram index? If so - how to create it better? on one column, on each column, on a concatenation expression? Also, I haven't understand which type of index is better - GIN or GiST. I've read that GIN is usually better for regular full text search, but GiST is better for trigram search. Is that correct?

I also wanted to ask if there is better way to write the above query?

If anyone wonders why I chose trigram, and not regular full text search - it's because search strings will come from processing some user input, so there can be mistakes or even cases when english 'o' or 'c' is replaced by cirillic letters. my database records or search can also contain alphanumeric data, which is also better processed with trigram.


Solution

  • In this case you need a GiST index, because only that can be used with ORDER BY queries using the trigram distance operator:

    CREATE INDEX ON public.test_table USING gist
       ((name || (cars ->> 'values') || surname) gist_trgm_ops);
    

    The query should then be rewritten to:

    SELECT *,
           similarity(
              'Malcolm Joe likes Ferrary, but hates BMW',
              (name || (cars ->> 'values') || surname)
           ) sim
    FROM public.test_table
    WHERE ((name || (cars ->> 'values') || surname)
           <->
           'Malcolm Joe likes Ferrary, but hates BMW')
          < 0.95
    ORDER BY (name || (cars ->> 'values') || surname)
             <->   /* trigram distance */
             'Malcolm Joe likes Ferrary, but hates BMW'
    LIMIT 50;
    

    The query had to be rewritten, because there is index support for <->, but not for similarity() in ORDER BY expressions.

    I added the LIMIT to hint the optimizer, put in a limit as appropriate.

    I think that in general GIN indexes perform better for large tables, but I am not certain. You have no choice with this query anyway, because GIN indexes won't support that ORDER BY clause.