Text searches in a Postgres table with a simple GIN index ordered by creation timestamp have wildly different latencies based on the keyword that is being searched. While searches for frequent words run in milliseconds, searches for rare words can take up to 10 minutes to execute.
I found a similar issue discussed here, but without a solution offered that works in every case. In the observations/experiments section below we'll go through some of the solutions we tried, and while these improve the situation, we don't feel confident they eliminate the problem fully.
The ask here is for: a) We're looking for feedback on these ideas and what else can be done to remedy the situation. It seems easy to optimize the query for one case, but we haven't found one that works for both cases (frequent and rare words) b) a better understanding on how the postgres query planner is making the decision to either to use the GIN index first or the created_at index. Important aspects here seem to be the statistics collected about each search term and the page size requested.
We're also thinking of other architectural changes such as using ElasticSearch for this kind of search, but I'd like this question limited to whether it can be solved using SQL/Postgres.
We run a community site, where members can post content which is then distributed to other members in their neighborhood mostly via email, after it was curated by staff. It's a Rails app using Postgres to persist the data. The queries are executed by our staff and it is not predictable what keywords are searched by them. An important aspect is that the results absolutely need to be ordered by creation date cannot be restricted to a specific creation date range.
PG version: PostgreSQL 12.14 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit
Relevant parts of table in question (posts):
frontporchforum=> \d posts
Table "public.posts"
Column | Type | Collation | Nullable | Default
------------------------------+-----------------------------+-----------+----------+-----------------------------------
id | integer | | not null | nextval('posts_id_seq'::regclass)
user_id | integer | | |
title | character varying(255) | | |
content | text | | |
created_at | timestamp without time zone | | |
...
Indexes:
"posts_pkey" PRIMARY KEY, btree (id)
"index_posts_on_created_at" btree (created_at)
"posts_title_simple_gin" gin (to_tsvector('simple'::regconfig, title::text))
...
The table is pretty small with less than 3 million rows.
Fast query:
frontporchforum=> EXPLAIN (ANALYZE, BUFFERS, COSTS) SELECT * FROM "posts" WHERE to_tsvector('simple', posts.content) @@ websearch_to_tsquery('simple', 'dog') ORDER BY "posts"."created_at" DESC LIMIT 25 OFFSET 0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..2281.59 rows=25 width=718) (actual time=0.252..62.398 rows=25 loops=1)
Buffers: shared hit=825
-> Index Scan Backward using index_posts_on_created_at on posts (cost=0.43..6968199.09 rows=76367 width=718) (actual time=0.250..62.383 rows=25 loops=1)
Filter: (to_tsvector('simple'::regconfig, content) @@ '''dog'''::tsquery)
Rows Removed by Filter: 935
Buffers: shared hit=825
Planning Time: 0.205 ms
Execution Time: 62.435 ms
The reason this is fast is because there are a lot of posts about dogs (lost, found etc...) that are fairly even distributed. PG has pretty up to date stats on 'dog' in the GIN (76367) so, it is pretty confident it can find 25 hits pretty quickly scanning the created_at index backwards.
So far so good. But what happens when searching for a rare keyword or a keyword that is not well distributed? E.g. wombat
only is mentioned 7 times in our database as we only serve neighborhoods in Vermont at this time and wombats are not exactly a hot topic. A hot topic example is 'Irene', a tropical storm that caused a lot of destruction in Vermont when it happened but then hasn't been talked much about in the last 10 years.
For the search terms I can see in the statistics (about 1000), those all fall into this category:
SELECT stavalues1 from pg_statistic WHERE starelid = (SELECT oid FROM pg_class WHERE relname = 'posts_content_simple_gin');
...
{1,2,3,4,5,6,7,8,9,a,d,e,i,...,dog,...vermont.craigslist.org}
Slow query (uneven distribution):
EXPLAIN (ANALYZE, BUFFERS, COSTS) SELECT * FROM "posts" WHERE to_tsvector('simple', posts.content) @@ websearch_to_tsquery('simple', 'irene') ORDER BY "posts"."created_at" DESC LIMIT 25 OFFSET 0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..14391.59 rows=25 width=718) (actual time=1416.468..12792.213 rows=25 loops=1)
Buffers: shared hit=39287 read=14389
I/O Timings: read=8175.420
-> Index Scan Backward using index_posts_on_created_at on posts (cost=0.43..6968199.09 rows=12105 width=718) (actual time=1416.466..12792.183 rows=25 loops=1)
Filter: (to_tsvector('simple'::regconfig, content) @@ '''irene'''::tsquery)
Rows Removed by Filter: 53070
Buffers: shared hit=39287 read=14389
I/O Timings: read=8175.420
Planning Time: 0.192 ms
Execution Time: 12792.261 ms
There are only ~3k posts mentioning Irene in 2011 here, yet PG estimates it to be 12105. Not knowing anything about distribution in the stats it's a fair assumption to start with the created_at index and hope to get 25 records off the heap that match the text search filter and exit early. However, it takes 12 seconds in this case, mostly since so many rows need to be read from the heap and thrown away during the filter evaluation.
Slow query (rare keyword):
frontporchforum=> EXPLAIN (ANALYZE, BUFFERS, COSTS) SELECT * FROM "posts" WHERE to_tsvector('simple', posts.content) @@ websearch_to_tsquery('simple', 'wombat') ORDER BY "posts"."created_at" DESC LIMIT 25 OFFSET 0;
NOTICE: word is too long to be indexed
DETAIL: Words longer than 2047 characters are ignored.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..14391.59 rows=25 width=718) (actual time=89154.539..607701.864 rows=7 loops=1)
Buffers: shared hit=1596257 read=821352
I/O Timings: read=359206.736
-> Index Scan Backward using index_posts_on_created_at on posts (cost=0.43..6968199.09 rows=12105 width=718) (actual time=89154.537..607701.851 rows=7 loops=1)
Filter: (to_tsvector('simple'::regconfig, content) @@ '''wombat'''::tsquery)
Rows Removed by Filter: 2667579
Buffers: shared hit=1596257 read=821352
I/O Timings: read=359206.736
Planning Time: 0.438 ms
Execution Time: 607702.931 ms
Solid 10 minutes here. Since there are only 7 matching rows, it can never exit early from the index scan and basically reads every record. The problem seems to be that the planner has an estimate of 12105 records and therefore thinks the index scan would be faster than using the gin index first.
Force GIN index
By duplicating the search term, Postgres does prioritize using the GIN first and then re-establishes the sort order via a heap scan. This works really great for the rare words (10ms instead of 10minutes):
frontporchforum=> EXPLAIN (ANALYZE, BUFFERS, COSTS) SELECT * FROM "posts" WHERE to_tsvector('simple', posts.content) @@ websearch_to_tsquery('simple', 'wombat wombat') ORDER BY "posts"."created_at" DESC LIMIT 25 OFFSET 0;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=285.94..286.00 rows=25 width=718) (actual time=6.469..6.472 rows=7 loops=1)
Buffers: shared hit=10 read=11
I/O Timings: read=5.251
-> Sort (cost=285.94..286.08 rows=55 width=718) (actual time=6.466..6.468 rows=7 loops=1)
Sort Key: created_at DESC
Sort Method: quicksort Memory: 34kB
Buffers: shared hit=10 read=11
I/O Timings: read=5.251
-> Bitmap Heap Scan on posts (cost=52.43..284.39 rows=55 width=718) (actual time=3.860..6.411 rows=7 loops=1)
Recheck Cond: (to_tsvector('simple'::regconfig, content) @@ '''wombat'' & ''wombat'''::tsquery)
Heap Blocks: exact=7
Buffers: shared hit=7 read=11
I/O Timings: read=5.251
-> Bitmap Index Scan on posts_content_simple_gin (cost=0.00..52.41 rows=55 width=0) (actual time=3.368..3.369 rows=7 loops=1)
Index Cond: (to_tsvector('simple'::regconfig, content) @@ '''wombat'' & ''wombat'''::tsquery)
Buffers: shared hit=5 read=6
I/O Timings: read=2.310
Planning Time: 10.760 ms
Execution Time: 6.547 ms
The downside is that the query becomes very slow for frequent words since so many records need to be read in a heapscan to establish the sort order.
frontporchforum=> EXPLAIN (ANALYZE, BUFFERS, COSTS) SELECT * FROM "posts" WHERE to_tsvector('simple', posts.content) @@ websearch_to_tsquery('simple', 'dog dog') ORDER BY "posts"."created_at" DESC LIMIT 25 OFFSET 0;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8787.92..8787.99 rows=25 width=718) (actual time=61820.290..61820.297 rows=25 loops=1)
Buffers: shared hit=26398 read=71113
I/O Timings: read=23958.101
-> Sort (cost=8787.92..8793.38 rows=2184 width=718) (actual time=61820.288..61820.292 rows=25 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 59kB
Buffers: shared hit=26398 read=71113
I/O Timings: read=23958.101
-> Bitmap Heap Scan on posts (cost=68.93..8726.29 rows=2184 width=718) (actual time=80.124..61604.023 rows=76544 loops=1)
Recheck Cond: (to_tsvector('simple'::regconfig, content) @@ '''dog'' & ''dog'''::tsquery)
Rows Removed by Index Recheck: 345396
Heap Blocks: exact=31843 lossy=33722
Buffers: shared hit=26398 read=71113
I/O Timings: read=23958.101
-> Bitmap Index Scan on posts_content_simple_gin (cost=0.00..68.38 rows=2184 width=0) (actual time=73.247..73.247 rows=76544 loops=1)
Index Cond: (to_tsvector('simple'::regconfig, content) @@ '''dog'' & ''dog'''::tsquery)
Buffers: shared hit=85 read=42
I/O Timings: read=32.140
Planning Time: 0.210 ms
Execution Time: 61820.690 ms
Note that this is not improved much by disabling heap scans or setting the random page cost to 1 (we're on SSDs).
Increase page size
We noticed that by increasing the page size the query planner can be swayed towards prioritizing the GIN index. E.g. asking for 100 wombats instead of 25 the query performs well.
frontporchforum=> EXPLAIN (ANALYZE, BUFFERS, COSTS) SELECT * FROM "posts" WHERE to_tsvector('simple', posts.content) @@ websearch_to_tsquery('simple', 'wombat') ORDER BY "posts"."created_at" DESC LIMIT 100 OFFSET 0;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=42114.35..42126.02 rows=100 width=718) (actual time=19.234..23.760 rows=7 loops=1)
Buffers: shared hit=19 read=9
I/O Timings: read=6.840
-> Gather Merge (cost=42114.35..43291.37 rows=10088 width=718) (actual time=19.233..23.756 rows=7 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=19 read=9
I/O Timings: read=6.840
-> Sort (cost=41114.33..41126.94 rows=5044 width=718) (actual time=2.489..2.490 rows=2 loops=3)
Sort Key: created_at DESC
Sort Method: quicksort Memory: 34kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
Buffers: shared hit=19 read=9
I/O Timings: read=6.840
-> Parallel Bitmap Heap Scan on posts (cost=141.81..40921.55 rows=5044 width=718) (actual time=1.651..2.451 rows=2 loops=3)
Recheck Cond: (to_tsvector('simple'::regconfig, content) @@ '''wombat'''::tsquery)
Heap Blocks: exact=7
Buffers: shared hit=5 read=9
I/O Timings: read=6.840
-> Bitmap Index Scan on posts_content_simple_gin (cost=0.00..138.79 rows=12105 width=0) (actual time=4.488..4.488 rows=7 loops=1)
Index Cond: (to_tsvector('simple'::regconfig, content) @@ '''wombat'''::tsquery)
Buffers: shared hit=4 read=3
I/O Timings: read=4.179
Planning Time: 0.232 ms
Execution Time: 23.810 ms
It seems counterintuitive but the page size likely is a very important input for the query planner when a search result needs to be ordered.
The downside of this approach are two-fold:
We're leaning towards a strategy of loading the post IDs of 250 records in the query and then reload all records matching the first 25 only. It's a fairly awkward approach but from testing it against a subset of rare and frequent terms has been the best compromise so far.
We are not superconfident it will work for all search terms and words and could e.g. imagine that there are frequent words that produce results quickly under the current query planner decisions but could become slow with the proposed higher page size. We are not sure how to determine this and generally would prefer a cleaner solution - hence this ask.
Any ideas, suggestions for a better solution and/or explanations/insights about the query planner are much appreciated!
You can instruct Postgres to create bigger statistics, so the frequency of more words will be known. The frequency of "uncommon" (i.e. not in the list of most common values) words will also be better estimated.
The doc says
The amount of information stored in pg_statistic by ANALYZE, in particular the maximum number of entries in the most_common_vals and histogram_bounds arrays for each column, can be set on a column-by-column basis using the ALTER TABLE SET STATISTICS command, or globally by setting the default_statistics_target configuration variable. The default limit is presently 100 entries. Raising the limit might allow more accurate planner estimates to be made, particularly for columns with irregular data distributions, at the price of consuming more space in pg_statistic and slightly more time to compute the estimates. Conversely, a lower limit might be sufficient for columns with simple data distributions.
But because you are not using a column but rather an expression, you set the statistics at the index level