postgresqlindexingtrigramunaccent

Multi-column index with unaccent and pg_trgm (matching dirty data)


I have a table with customer data, it has 12M+ records. I want to query it based on few fields, for example: first_name, last_name, birth_place. But data are really dirty, so as a result I want even records that aren't completely matching. I'm using modules unaccent and pg_trgm for that.

I followed this question to be able to use unaccent in index, hence f_unaccent() instead of unaccent() in query.

Index:

CREATE INDEX first_name_idx ON customer USING gist(f_unaccent(coalesce(first_name, '')) gist_trgm_ops);
CREATE INDEX last_name_idx ON customer USING gist(f_unaccent(coalesce(last_name, '')) gist_trgm_ops);
CREATE INDEX birthplace_idx ON customer USING gist(f_unaccent(coalesce(birthplace, '')) gist_trgm_ops);

SELECT:

WITH t AS (
SELECT id, first_name, f_unaccent(coalesce(first_name, '')) <-> unaccent('Oliver') as first_name_distance, 
    last_name, f_unaccent(coalesce(last_name, '')) <-> unaccent('Twist') as last_name_distance,
    birthplace, f_unaccent(coalesce(birthplace, '')) <-> unaccent('London') as birthplace_distance, 
    FROM customer
),
s AS (
SELECT t.id, t.first_name_distance + t.last_name_distance + t.birthplace_distance as total FROM t
)

select * from t join s on (t.id = s.id);

When I run analyze on it, it does sequential scan. It uses no index. I know that first select runs on whole table, so maybe it's good. I'm using <->, not similarity(text, text) function, because I want even records with some field with 0 similarity, SUM of similarities it's what I care about.

On real data this query (with 6 fields in contition, not 3) takes around 12 minutes (without indexes, I didn't create them because I saw on test data that they aren't even used... )

How can I make this query run faster? Thank you


Solution

  • Since the query fetches all rows from customer, using a sequential scan is the fastest option.