postgresqlindexingtrigram

Trigram search with operators not selecting the correct result set


I have a use case where a substring search is required within a string provided by the client. The substrings to match are present in a match table within the value column.

CREATE TABLE match (
    value varchar not null,
    type  varchar not null,
);

A GiST trigram index has been created on the value column.

CREATE INDEX trgm_idx ON match USING GIST (value gist_trgm_ops(siglen=32));

For the purpose of this question let us assume that match contains the following two rows:

value type
some_str type_a
12345678 type_b

I would like to understand how exactly the <<% works for strict word similarity.

The query below:

SELECT value,strict_word_similarity(value, 'file some_str 01') as sml
FROM match
WHERE 'file some_str 01' <<% value
ORDER BY sml DESC, value;

returns the following:

value sml
some_str 1

Similarly, the query with 12345678 returns the correct match row:

SELECT value,strict_word_similarity(value, 'file 12345678 01') as sml
FROM match
WHERE 'file 12345678 01' <<% value
ORDER BY sml DESC, value;
value sml
12345678 1

However, if both the search terms are included in the query, no results are returned:

SELECT value,strict_word_similarity(value, 'file some_str 12345678 01') as sml
FROM match
WHERE 'file some_str 12345678 01' <<% value
ORDER BY sml DESC, value;

No rows returned

As per my expectations both the matching rows should be returned.

I would like to know what exactly is preventing the matching from happening.

Interestingly, if strict_word_similarity(value, 'file some_str 12345678 01') is used in the WHERE clause instead of <<%, both rows are indeed returned. However, the GiST index is no longer used by the query.

Is it possible to have multiple matches while also using the index as defined above


Solution

  • Your function and your operator work in opposite directions. The operator that matches strict_word_similarity argument ordering is %>>, not <<%.

    With the operator you are using, neither of the rows are returned because they have scores which don't meet the default pg_trgm.strict_word_similarity_threshold value. If you use the other operator, both will be returned.