I've been executing the same query multiple times while utilizing the same database connection. After a few iterations, I've noticed a substantial decrease in query performance. This behavior has been consistently reproducible across various operating systems, including Arch Linux, Ubuntu, and the official Docker image. Additionally, I've observed it across different PostgreSQL versions, namely Postgres 15 and Postgres 16.
I've also noticed that starting a new connection resolve the issue, then it comes back after running the query a few time. The dataset doesn't change between runs. I've tried tweaking the parameters shared_buffers
and work_mem
but it didn't change anything.
Here is the query :
select a.id
from answer a
inner join question q on q.id = a.question_id
inner join response r on r.id = a.response_id
inner join campaign c on c.id = r.campaign_id
where q.project_indicator_id = any($1)
and (c.start_at between $2 and $3 or c.end_at between $2 and $3)
and (c.tag_id = any($4) or cardinality($4) = 0)
After installing auto_explain
, I have observed that the query plan varies between runs, even though the query itself remains unchanged.
When the query is fast (https://www.pgexplain.dev/plan/d8d5ffa4-6544-4e16-8b9b-f159582323a0#plan) :
Query Parameters: $1 = '{a577b5d0-844d-4207-91fe-87855900f63e}', $2 = '2018-01-01', $3 = '2025-06-01', $4 = '{}'
Nested Loop (cost=6.16..412.52 rows=39 width=37) (actual time=0.131..0.198 rows=2 loops=1)
Output: a.id
Inner Unique: true
Buffers: shared hit=22
-> Nested Loop (cost=5.88..400.46 rows=39 width=74) (actual time=0.110..0.164 rows=2 loops=1)
Output: a.id, r.campaign_id
Inner Unique: true
Buffers: shared hit=16
-> Nested Loop (cost=5.47..383.34 rows=39 width=74) (actual time=0.087..0.097 rows=2 loops=1)
Output: a.id, a.response_id
Buffers: shared hit=8
-> Index Scan using idx_b6f7494e6ea56129 on public.question q (cost=0.29..8.30 rows=1 width=37) (actual time=0.035..0.038 rows=1 loops=1)
Output: q.id, q.survey_id, q.project_indicator_id, q.required, q.created_at, q."position", q.disabled_at, q.condition_question_id
Index Cond: (q.project_indicator_id = ANY ('{a577b5d0-844d-4207-91fe-87855900f63e}'::text[]))
Buffers: shared hit=3
-> Bitmap Heap Scan on public.answer a (cost=5.18..374.06 rows=98 width=111) (actual time=0.040..0.046 rows=2 loops=1)
Output: a.id, a.response_id, a.question_id, a.type, a.scale, a.number, a.percent, a.text, a.choice_id
Recheck Cond: (a.question_id = q.id)
Heap Blocks: exact=2
Buffers: shared hit=5
-> Bitmap Index Scan on idx_dadd4a251e27f6bf (cost=0.00..5.16 rows=98 width=0) (actual time=0.027..0.027 rows=2 loops=1)
Index Cond: (a.question_id = q.id)
Buffers: shared hit=3
-> Index Scan using response_pkey on public.response r (cost=0.41..0.44 rows=1 width=74) (actual time=0.030..0.030 rows=1 loops=2)
Output: r.id, r.campaign_id, r.created_at, r.code
Index Cond: (r.id = a.response_id)
Buffers: shared hit=8
-> Index Scan using data_collect_pkey on public.campaign c (cost=0.28..0.31 rows=1 width=37) (actual time=0.013..0.013 rows=1 loops=2)
Output: c.id, c.survey_id, c.start_at, c.title, c.created_at, c.end_at, c.population_size, c.deactivated_at, c.tag_id
Index Cond: (c.id = r.campaign_id)
Filter: (((c.start_at >= '2018-01-01'::date) AND (c.start_at <= '2025-06-01'::date)) OR ((c.end_at >= '2018-01-01'::date) AND (c.end_at <= '2025-06-01'::date)))
Buffers: shared hit=6
Settings: work_mem = '512MB', temp_buffers = '512MB', max_parallel_workers = '16', max_parallel_workers_per_gather = '4', parallel_leader_participation = 'off', geqo_effort = '7'
And when the query is slow (https://www.pgexplain.dev/plan/dfac4e20-949a-4760-bd5d-17d7e3c8a7f8#plan) :
Query Parameters: $1 = '{a577b5d0-844d-4207-91fe-87855900f63e}', $2 = '2018-01-01', $3 = '2025-06-01', $4 = '{}'
Hash Join (cost=88.33..384.05 rows=1 width=37) (actual time=965.921..2169.968 rows=2 loops=1)
Output: a.id
Inner Unique: true
Hash Cond: (a.question_id = q.id)
Buffers: shared hit=229624 read=15859
-> Nested Loop (cost=0.71..295.47 rows=366 width=74) (actual time=16.525..2074.728 rows=616974 loops=1)
Output: a.id, a.question_id
Buffers: shared hit=229621 read=15859
-> Nested Loop (cost=0.29..231.75 rows=27 width=37) (actual time=11.336..108.081 rows=46372 loops=1)
Output: r.id
Buffers: shared hit=9162 read=744
-> Seq Scan on public.campaign c (cost=0.00..97.49 rows=1 width=37) (actual time=3.685..6.152 rows=1687 loops=1)
Output: c.id, c.survey_id, c.start_at, c.title, c.created_at, c.end_at, c.population_size, c.deactivated_at, c.tag_id
Filter: ((((c.start_at >= $2) AND (c.start_at <= $3)) OR ((c.end_at >= $2) AND (c.end_at <= $3))) AND ((c.tag_id = ANY ($4)) OR (cardinality($4) = 0)))
Rows Removed by Filter: 6
Buffers: shared hit=1 read=33
-> Index Scan using idx_3e7b0bfbf639f774 on public.response r (cost=0.29..133.79 rows=47 width=74) (actual time=0.016..0.056 rows=27 loops=1687)
Output: r.id, r.campaign_id, r.created_at, r.code
Index Cond: (r.campaign_id = c.id)
Buffers: shared hit=9161 read=711
-> Index Scan using idx_dadd4a25fbf32840 on public.answer a (cost=0.42..2.13 rows=23 width=111) (actual time=0.022..0.040 rows=13 loops=46372)
Output: a.id, a.response_id, a.question_id, a.type, a.scale, a.number, a.percent, a.text, a.choice_id
Index Cond: (a.response_id = r.id)
Buffers: shared hit=220459 read=15115
-> Hash (cost=87.46..87.46 rows=13 width=37) (actual time=0.037..0.039 rows=1 loops=1)
Output: q.id
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=3
-> Bitmap Heap Scan on public.question q (cost=42.93..87.46 rows=13 width=37) (actual time=0.032..0.033 rows=1 loops=1)
Output: q.id
Recheck Cond: (q.project_indicator_id = ANY ($1))
Heap Blocks: exact=1
Buffers: shared hit=3
-> Bitmap Index Scan on idx_b6f7494e6ea56129 (cost=0.00..42.93 rows=13 width=0) (actual time=0.025..0.025 rows=1 loops=1)
Index Cond: (q.project_indicator_id = ANY ($1))
Buffers: shared hit=2
Settings: work_mem = '512MB', temp_buffers = '512MB', max_parallel_workers = '16', max_parallel_workers_per_gather = '4', parallel_leader_participation = 'off', geqo_effort = '7'
After running the query a few times, it decides that the generic plan is just as good as the custom plan, so switches to the generic plan. Apparently it is wrong about that. You can force it to use the custom plan always be setting plan_cache_mode.