I have a small hobby website I want to release on my server. I chose to use kubernetes for that, as I partly use it at work as well, so I'd like to get familiar. I bought hetzner debian server and installed k3s. Now are deploying PostgreSQL container (version 15.2, not cluster) as per this tutorial. (I did a minor changes I saw from other tutorials, should not be relevant).
It runs fine, I was happy with it. But I tried to restart deployment to make sure data are not lost if server goes down for some reason. After a few restarts, database is corrupted.
Once I saw:
PANIC: invalid magic number 0000 in log segment 000000010000000000000000, offset 0
another time:
invalid contrecord length 1174 (expected 48430224) at 0/195BC90
another time:
PANIC: could not locate a valid checkpoint record
When I tried to google how to recover from this, I did not find any safe options and mostly the suggestions were to restore backup.
So my question is, how do I safely restart/shutdown PostgreSQL container? Am I missing some shutdown config for PostgreSQL pod in k8s?
Update 1:
I was restarting deployment from k9s with r
command. I think UI made it look like it was rotated right away, but it probably takes some time. So I think I triggered multiple restarts every 10 seconds and that might have corrupted the DB. Anyway I added terminationGracePeriodSeconds: 60
and used preStop
hook from answer. Thanks
Update 2: I imported DB, did a restart and again same issue:
could not locate a valid checkpoint record
Update 3:
I replaced Deployment
with StatefulSet
and it seems to be handling restarts better. Tried over 10 restarts and no issues. Whereas before it crashed around 4th restart.
Of course the best practice is using an operator like cloudnative-pg or postgres-operator but they are pretty big and probably have way more feature for a simple workload. Here is a simple solution for your problem.
Add below to your pod spec
preStop:
exec:
command: ["/usr/local/bin/pg_ctl stop -D /var/lib/postgresql/data -w -t 60 -m fast"]
Basically when you kill a pod, Kubernetes signals SIGTERM
and gives 30 seconds for your pod, after that time it sends SIGKILL
. When postgres receive SIGTERM
it won't accept net connections but it won't terminate existing terminations, so any client will block db's termination, and after 30 seconds pod will receive SIGKILL
which is very bad for postgres doc. So you need to safely shutdown postgres somehow, with preStop
hook you can.
This is the exact chronological order of your pod:
state=Terminating
from Pod controllerterminationGracePeriodSeconds
timer starts (default is 30 seconds)preStop
hook: pg_cli ...
SIGTERM
is sent: Postgres won't accept new connectionsterminationGracePeriods
(configurable from yaml)SIGKILL
is sentAlso you need to set .spec.strategy.type==Recreate
in Deployment.
For the pg_cli
commands you can refer this summary, most useful one for you looks like -m fast
.
SIGTERM
:
SIGINT
:
SIGTERM
to existing servers processes (?) they'll exit promptlySIGQUIT
:
SIGQUIT
to all child processes, if they don't terminate in 5 secs sends SIGKILL
EDIT:
Apparently Recreate
only guarantees recreation for update (old rs to new rs), but it does not guarantee 1 pod at a time if pod randomly dies. While new pod is creating old one may be in terminating phase, and because of race condition data may be corrupt. Relevant Doc
This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.