postgresqlarchivingpgbackrest

Solutions for restarting Postgres when pg_wal full


This question is a follow up on a previous question I posted about efficient ways to max up Postgres pg_wal directory (for training purposes) : How to bloat pg_wal?.

I'm now wondering what are the possible solutions for getting a Postgres server up and running once the partition is full because pg_wal has filled up.

I'm interesting in solutions that do not involve adding extra disk space to the partition.

Here are a the 2 other solutions I've come accross while discussing with colleagues :

  1. Move the entire pg_wal directory to another partition with sufficient available disk space and point to this new location from $PGDATA (ok this is kind of an additional disk space solution)
  2. Regain disk space by removing all WALs that have already been archived (a file with the same name and suffixed by .done should be present in pg_wal/archive_status)

Using pgBackRest I ran the archive-push command manually and then removed the WALs from pg_wal directory but I got the following error when starting Postgres :

2022-06-01 13:54:47 UTC [9334]: user=,db=,app=,client=LOG:  invalid primary checkpoint record
2022-06-01 13:54:47 UTC [9334]: user=,db=,app=,client=PANIC:  could not locate a valid checkpoint record

Obviously I've removed too many files, but I'm wondering if a clean solution based on the same idea could be used.

Question

Is there a way to clean up pg_wal without having to restart Postgres once it has shutdown because pg_wal has filled up ?


Solution

  • I'm interesting in solutions that do not involve adding extra disk space to the partition.

    Hmm, then you are excluding the proper solution, which is exactly to increase the disk space on the WAL file system.

    Your first solution (move pg_wal and put a symbolic link into the data directory) is entirely feasible. But as you say, that requires additional disk space, so why not extend the actual WAL file system instead?

    Your second idea is not commendable. True, WAL segments that are marked as .done can be removed, but PostgreSQL will do that automatically at the next checkpoint anyway, so there should not be many of those lying around at any given time. And as you noticed, manually messing with the data directory is not a good idea; the danger of breaking your database is just too high.

    Hands off from pg_resetwal. This executable, when run on a crashed data directory, will remove WAL and put PostgreSQL into a state where you can start it, but it will cause data corruption. pg_resetwal is intended as a desperate measure to get a corrupted server to start so you can salvage some of the data.