postgresqlkubernetes

Invalid resource manager ID in primary checkpoint record


I've update my Airbyte image from 0.35.2-alpha to 0.35.37-alpha. [running in kubernetes]

When the system rolled out the db pod wouldn't terminate and I [a terrible mistake] deleted the pod. When it came back up, I get an error -

PostgreSQL Database directory appears to contain a database; Skipping initialization

2022-02-24 20:19:44.065 UTC [1] LOG:  starting PostgreSQL 13.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2022-02-24 20:19:44.065 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-24 20:19:44.065 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-24 20:19:44.071 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-24 20:19:44.079 UTC [21] LOG:  database system was shut down at 2022-02-24 20:12:55 UTC
2022-02-24 20:19:44.079 UTC [21] LOG:  invalid resource manager ID in primary checkpoint record
2022-02-24 20:19:44.079 UTC [21] PANIC:  could not locate a valid checkpoint record
2022-02-24 20:19:44.530 UTC [1] LOG:  startup process (PID 21) was terminated by signal 6: Aborted
2022-02-24 20:19:44.530 UTC [1] LOG:  aborting startup due to startup process failure
2022-02-24 20:19:44.566 UTC [1] LOG:  database system is shut down

Pretty sure the WAL file is corrupted, but I'm not sure how to fix this.


Solution

  • Warning - there is a potential for data loss

    This is a test system, so I wasn't concerned with keeping the latest transactions, and had no backup.

    First I overrode the container command to keep the container running but not try to start postgres.

    ...
        spec:
          containers:
            - name: airbyte-db-container
              image: airbyte/db
              command: ["sh"]
              args: ["-c", "while true; do echo $(date -u) >> /tmp/run.log; sleep 5; done"]
    ...
    

    And spawned a shell on the pod -

    kubectl exec -it -n airbyte airbyte-db-xxxx -- sh
    

    Run pg_reset_wal

    # dry-run first
    pg_resetwal --dry-run /var/lib/postgresql/data/pgdata
    

    Success!

    pg_resetwal /var/lib/postgresql/data/pgdata
    Write-ahead log reset
    

    Then removed the temp command in the container, and postgres started up correctly!