postgresqlamazon-auroraautovacuum

Long Aurora autovacuum on postgres system tables


We have had an incredibly long running autovacuum process running on one of our smaller database machines that we believe has been using a lot of Aurora:StorageIOUsage: enter image description here We determined this by running SELECT * FROM pg_stat_activity WHERE wait_event_type = 'IO'; and seeing the below results repeatedly.

 datid  |  datname |  pid  | usesysid |  usename  | application_name |  client_addr   | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |  wait_event  | state  | backend_xid | backend_xmin |                                                                                query                                                                                 |   backend_type    
--------+----------------------------+-------+----------+-----------+------------------+----------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+--------------+--------+-------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------
 398954 | postgres | 17582 |          |           |                  |                |                 |             | 2022-09-29 18:45:55.364654+00 | 2022-09-29 18:46:20.253107+00 | 2022-09-29 18:46:20.253107+00 | 2022-09-29 18:46:20.253108+00 | IO              | DataFileRead | active |             |     66020718 | autovacuum: VACUUM pg_catalog.pg_depend                                                                                                                              | autovacuum worker
 398954 | postgres | 17846 |          |           |                  |                |                 |             | 2022-09-29 18:46:04.092536+00 | 2022-09-29 18:46:29.196309+00 | 2022-09-29 18:46:29.196309+00 | 2022-09-29 18:46:29.19631+00  | IO              | DataFileRead | active |             |     66020732 | autovacuum: VACUUM pg_toast.pg_toast_2618                                                                                                                            | autovacuum worker

As you can see from the screenshot it has been going for well over a month, and is mainly for the pg_depend, pg_attribute, and pg_toast_2618 tables which are not all that large. I haven't been able to find any reason why these tables would need so much vacuuming other than maybe a database restore from our production environment (this is one of our lower environments). Here are the pg_stat_sys_tables entries for those tables and the pg_rewrite which is the table that pg_toast_2618 is associated with:

 relid | schemaname |    relname    | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum |        last_autovacuum        | last_analyze |       last_autoanalyze        | vacuum_count | autovacuum_count | analyze_count | autoanalyze_count 
-------+------------+---------------+----------+--------------+----------+---------------+-----------+-----------+-----------+---------------+------------+------------+---------------------+-------------+-------------------------------+--------------+-------------------------------+--------------+------------------+---------------+-------------------
  1249 | pg_catalog | pg_attribute  |   185251 |     12594432 | 31892996 |     119366792 |   1102817 |      3792 |   1065737 |          1281 |     543392 |    1069529 |               23584 |             | 2022-09-29 18:53:25.227334+00 |              | 2022-09-28 01:12:47.628499+00 |            0 |          1266763 |             0 |                36
  2608 | pg_catalog | pg_depend     |     2429 |    369003445 | 14152628 |      23494712 |   7226948 |         0 |   7176855 |             0 |     476267 |    7176855 |                   0 |             | 2022-09-29 18:52:34.523257+00 |              | 2022-09-28 02:02:52.232822+00 |            0 |           950137 |             0 |                71
  2618 | pg_catalog | pg_rewrite    |       25 |       155083 |  1785288 |       1569100 |     64127 |    314543 |     62472 |         59970 |       7086 |     377015 |               13869 |             | 2022-09-29 18:53:11.288732+00 |              | 2022-09-23 18:54:50.771969+00 |            0 |          1280018 |             0 |                81
  2838 | pg_toast   | pg_toast_2618 |        0 |            0 |  1413436 |       3954640 |    828571 |         0 |    825143 |             0 |      15528 |     825143 |             1653714 |             | 2022-09-29 18:52:47.242386+00 |              |                               |            0 |           608881 |             0 |                 0

I'm pretty new to Postgres and I'm wondering what could possibly cause this level of records to need to be cleaned up, and why it would take well over a month to accomplish considering we always have autovacuum set to TRUE. We are running Postgres version 10.17 on a single db.t3.medium, and the only thing I can think of at this point is to try increasing the instance size. Do we simply need to increase our database instance size on our aurora cluster so that this can be done more in memory? I'm at a bit of a loss for how to reduce this huge sustained spike in Storage IO costs.

Additional information for our autovaccum settings:

=> SELECT * FROM pg_catalog.pg_settings WHERE name LIKE '%autovacuum%';
                name                 |  setting  | unit |              category               |                                        short_desc                                         |                                                             extra_desc                                                             |  context   | vartype |       source       | min_val |  max_val   |                                        enumvals                                         | boot_val  | reset_val |            sourcefile             | sourceline | pending_restart 
-------------------------------------+-----------+------+-------------------------------------+-------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------+---------+--------------------+---------+------------+-----------------------------------------------------------------------------------------+-----------+-----------+-----------------------------------+------------+-----------------
 autovacuum                          | on        |      | Autovacuum                          | Starts the autovacuum subprocess.                                                         |                                                                                                                                    | sighup     | bool    | configuration file |         |            |                                                                                         | on        | on        | /rdsdbdata/config/postgresql.conf |         78 | f
 autovacuum_analyze_scale_factor     | 0.05      |      | Autovacuum                          | Number of tuple inserts, updates, or deletes prior to analyze as a fraction of reltuples. |                                                                                                                                    | sighup     | real    | configuration file | 0       | 100        |                                                                                         | 0.1       | 0.05      | /rdsdbdata/config/postgresql.conf |         55 | f
 autovacuum_analyze_threshold        | 50        |      | Autovacuum                          | Minimum number of tuple inserts, updates, or deletes prior to analyze.                    |                                                                                                                                    | sighup     | integer | default            | 0       | 2147483647 |                                                                                         | 50        | 50        |                                   |            | f
 autovacuum_freeze_max_age           | 200000000 |      | Autovacuum                          | Age at which to autovacuum a table to prevent transaction ID wraparound.                  |                                                                                                                                    | postmaster | integer | default            | 100000  | 2000000000 |                                                                                         | 200000000 | 200000000 |                                   |            | f
 autovacuum_max_workers              | 3         |      | Autovacuum                          | Sets the maximum number of simultaneously running autovacuum worker processes.            |                                                                                                                                    | postmaster | integer | configuration file | 1       | 262143     |                                                                                         | 3         | 3         | /rdsdbdata/config/postgresql.conf |         45 | f
 autovacuum_multixact_freeze_max_age | 400000000 |      | Autovacuum                          | Multixact age at which to autovacuum a table to prevent multixact wraparound.             |                                                                                                                                    | postmaster | integer | default            | 10000   | 2000000000 |                                                                                         | 400000000 | 400000000 |                                   |            | f
 autovacuum_naptime                  | 5         | s    | Autovacuum                          | Time to sleep between autovacuum runs.                                                    |                                                                                                                                    | sighup     | integer | configuration file | 1       | 2147483    |                                                                                         | 60        | 5         | /rdsdbdata/config/postgresql.conf |          9 | f
 autovacuum_vacuum_cost_delay        | 5         | ms   | Autovacuum                          | Vacuum cost delay in milliseconds, for autovacuum.                                        |                                                                                                                                    | sighup     | integer | configuration file | -1      | 100        |                                                                                         | 20        | 5         | /rdsdbdata/config/postgresql.conf |         73 | f
 autovacuum_vacuum_cost_limit        | -1        |      | Autovacuum                          | Vacuum cost amount available before napping, for autovacuum.                              |                                                                                                                                    | sighup     | integer | default            | -1      | 10000      |                                                                                         | -1        | -1        |                                   |            | f
 autovacuum_vacuum_scale_factor      | 0.1       |      | Autovacuum                          | Number of tuple updates or deletes prior to vacuum as a fraction of reltuples.            |                                                                                                                                    | sighup     | real    | configuration file | 0       | 100        |                                                                                         | 0.2       | 0.1       | /rdsdbdata/config/postgresql.conf |         22 | f
 autovacuum_vacuum_threshold         | 50        |      | Autovacuum                          | Minimum number of tuple updates or deletes prior to vacuum.                               |                                                                                                                                    | sighup     | integer | default            | 0       | 2147483647 |                                                                                         | 50        | 50        |                                   |            | f
 autovacuum_work_mem                 | -1        | kB   | Resource Usage / Memory             | Sets the maximum memory to be used by each autovacuum worker process.                     |                                                                                                                                    | sighup     | integer | default            | -1      | 2147483647 |                                                                                         | -1        | -1        |                                   |            | f
 log_autovacuum_min_duration         | -1        | ms   | Reporting and Logging / What to Log | Sets the minimum execution time above which autovacuum actions will be logged.            | Zero prints all actions. -1 turns autovacuum logging off.                                                                          | sighup     | integer | default            | -1      | 2147483647 |                                                                                         | -1        | -1        |                                   |            | f
 rds.force_autovacuum_logging_level  | disabled  |      | Customized Options                  | Emit autovacuum log messages irrespective of other logging configuration.                 | Each level includes all the levels that follow it.Set to disabled to disable this feature and fall back to using log_min_messages. | sighup     | enum    | default            |         |            | {debug5,debug4,debug3,debug2,debug1,info,notice,warning,error,log,fatal,panic,disabled} | disabled  | disabled  |                                   |            | f

EDIT: The issue ended up being because we created a replication slot in all of our environments for replicating the data to an external source, but we were only consuming the replication slot in a subset of environments. So after a while this issue would crop up in environments where we weren't consuming the replication the slot, since the system had to hold on to the dead tuples. Once we removed the replication slots on the envs that weren't consuming them, the issue went away.


Solution

  • I would say you have some very long-lived snapshot being held. These tables need to be vacuumed, but the vacuum doesn't accomplish anything because the dead tuples can't be removed as some old snapshot still can see them. So immediately after being vacuumed, they are still eligible to be vacuumed again. So it tries again every 5 seconds (autovacuum_naptime), because autovacuum doesn't have a way to say "Don't bother until this snapshot which blocked me from accomplishing anything last time goes away"

    Check pg_stat_activity for very old 'idle in transaction' and for any prepared transactions.