I have defined the following index:
CREATE INDEX
users_search_idx
ON
auth_user
USING
gin(
username gin_trgm_ops,
first_name gin_trgm_ops,
last_name gin_trgm_ops
);
I am performing the following query:
PREPARE user_search (TEXT, INT) AS
SELECT
username,
email,
first_name,
last_name,
( -- would probably do per-field weightings here
s_username + s_first_name + s_last_name
) rank
FROM
auth_user,
similarity(username, $1) s_username,
similarity(first_name, $1) s_first_name,
similarity(last_name, $1) s_last_name
WHERE
username % $1 OR
first_name % $1 OR
last_name % $1
ORDER BY
rank DESC
LIMIT $2;
The auth_user
table has 6.2 million rows.
The speed of the query seems to depend very heavily on the number of results potentially returned by the similarity
query.
Increasing the similarity threshold via set_limit
helps, but reduces usefulness of results by eliminating partial matches.
Some searches return in 200ms, others take ~ 10 seconds.
We have an existing implementation of this feature using Elasticsearch that returns in < 200ms for any query, while doing more complicated (better) ranking.
I would like to know if there is any way we could improve this to get more consistent performance?
It's my understanding that GIN index (inverted index) is the same basic approach used by Elasticsearch so I would have thought there is some optimization possible.
An EXPLAIN ANALYZE EXECUTE user_search('mel', 20)
shows:
Limit (cost=54099.81..54099.86 rows=20 width=52) (actual time=10302.092..10302.104 rows=20 loops=1)
-> Sort (cost=54099.81..54146.66 rows=18739 width=52) (actual time=10302.091..10302.095 rows=20 loops=1)
Sort Key: (((s_username.s_username + s_first_name.s_first_name) + s_last_name.s_last_name)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Nested Loop (cost=382.74..53601.17 rows=18739 width=52) (actual time=118.164..10293.765 rows=8380 loops=1)
-> Nested Loop (cost=382.74..53132.69 rows=18739 width=56) (actual time=118.150..10262.804 rows=8380 loops=1)
-> Nested Loop (cost=382.74..52757.91 rows=18739 width=52) (actual time=118.142..10233.990 rows=8380 loops=1)
-> Bitmap Heap Scan on auth_user (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
Rows Removed by Index Recheck: 2434523
Heap Blocks: exact=49337 lossy=53104
-> BitmapOr (cost=382.74..382.74 rows=18757 width=0) (actual time=107.436..107.436 rows=0 loops=1)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=40.200..40.200rows=88908 loops=1)"
Index Cond: ((username)::text % 'mel'::text)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=43.847..43.847rows=102028 loops=1)"
Index Cond: ((first_name)::text % 'mel'::text)
-> Bitmap Index Scan on users_search_idx (cost=0.00..122.89 rows=6252 width=0) (actual time=23.387..23.387rows=58740 loops=1)"
Index Cond: ((last_name)::text % 'mel'::text)
-> Function Scan on similarity s_username (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=8380)
-> Function Scan on similarity s_first_name (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
-> Function Scan on similarity s_last_name (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=8380)
Execution time: 10302.559 ms
Server is Postgres 9.6.1 running on Amazon RDS
Shortly after posting the question I found this info: https://www.postgresql.org/message-id/464F3C5D.2000700@enterprisedb.com
So I tried
-> SHOW work_mem;
4MB
-> SET work_mem='12MB';
-> EXECUTE user_search('mel', 20);
(results returned in ~1.5s)
This made a big improvement (previously > 10s)!
1.5s is still way slower than ES for similar query so I would still like to hear any suggestions for optimising the query.
2.In response to comments, and after seeing this question (Postgresql GIN index slower than GIST for pg_trgm), I tried exactly the same set up with a GIST index in place of the GIN one.
Trying the same search above, it returned in ~3.5s, using default work_mem='4MB'
. Increasing work_mem
made no difference.
From this I conclude that GIST index is more memory efficient (did not hit pathological case like GIN did) but is slower than GIN when GIN is working properly. This is inline with what's described in the docs recommending GIN index.
3.I still don't understand why so much time is spent in:
-> Bitmap Heap Scan on auth_user (cost=382.74..52383.13 rows=18739 width=48) (actual time=118.128..10186.816 rows=8380loops=1)"
Recheck Cond: (((username)::text % 'mel'::text) OR ((first_name)::text % 'mel'::text) OR ((last_name)::text %'mel'::text))"
Rows Removed by Index Recheck: 2434523
Heap Blocks: exact=49337 lossy=53104
I don't understand why this step is needed or what it's doing.
There are the three Bitmap Index Scan
beneath it for each of the username % $1
clauses... these results then get combined with a BitmapOr
step. These parts are all quite fast.
But even in the case where we don't run out of work mem, we still spend nearly a whole second in Bitmap Heap Scan
.
I expect much faster results with this approach:
Create a GiST index with 1 column holding concatenated values:
CREATE INDEX users_search_idx ON auth_user
USING gist((username || ' ' || first_name || ' ' || last_name) gist_trgm_ops);
This assumes all 3 columns to be defined NOT NULL
(you did not specify). Else you need to do more.
Why not simplify with concat_ws()
?
Use a proper nearest-neighbor query, matching above index:
SELECT username, email, first_name, last_name
, similarity(username , $1) AS s_username
, similarity(first_name, $1) AS s_first_name
, similarity(last_name , $1) AS s_last_name
, row_number() OVER () AS rank -- greatest similarity first
FROM auth_user
WHERE (username || ' ' || first_name || ' ' || last_name) % $1 -- !!
ORDER BY (username || ' ' || first_name || ' ' || last_name) <-> $1 -- !!
LIMIT $2;
Expressions in WHERE
and ORDER BY
must match index expression!
In particular ORDER BY rank
(like you had it) will always perform poorly for a small LIMIT
picking from a much larger pool of qualifying rows, because it cannot use an index directly: The sophisticated expression behind rank
has to be calculated for every qualifying row, then all have to be sorted before the small selection of best matches can be returned. This is much, much more expensive than a true nearest-neighbor query that can pick the best results from the index directly without even looking at the rest.
row_number()
with empty window definition just reflects the ordering produced by the ORDER BY
of the same SELECT
.
Related answers:
As for your item 3.
, I added an answer to the question you referenced, that should explain it: