PostgreSQL v12. Database cluster size: 12+ Tb (30+ databases 100-600Gb each). Server: 40 cores, 128Gb ram.
For some reason, autovacuum stopped running. How to understand the reason? Yesterday, the workers were running every minute for each base (according to the settings), but then they stopped and are no longer running.
Settings:
autovacuum = on
autovacuum_max_workers = 4
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = 0.10
autovacuum_analyze_scale_factor = 0.05
autovacuum_vacuum_cost_delay = 0
autovacuum_vacuum_cost_limit = 1000
For example, here is a table in one of the databases:
relname: ods_storage_sync_queue
n_live_tup: 8
n_dead_tup: 180542
autovacuum_count: 74167
last_autovacuum: 2023-09-26 18:48:11.868
last_autoanalyze: 2023-09-26 18:48:11.869
The table contains an infinite number of dead tuples, but autovacuum does not clean it.
Another example:
relname: ods_storage_sync_queue
n_live_tup: 11434
n_dead_tup: 2064
autovacuum_count: 16727
last_autovacuum: 2023-09-26 18:13:21.612
last_autoanalyze: 2023-09-26 18:47:11.530
There are 18% dead tupples. According to autovacuum_vacuum_scale_factor = 10%, autovacuum should have started a long time ago, but a day has passed... nothing happens, although yesterday everything started once a minute, as set in autovacuum_naptime.
What else can I check? How to understand the reason?
The described problem was solved by updating PostgreSQL from 12.1 to 12.16. We found in Release Notes that somewhere along the way from 12.1 to 12.16 an error was fixed that caused autovacuum to be stuck.