I run parallel
like this, abstracting out some of the details:
generate_job_list | parallel -j10 -q bash -c 'echo -n "running {}" ; dostuff {}'
I've noticed that sometimes the child processes that parallel
spawns die having received SIGKILL (I know because dostuff
is a psql
command to run a vacuum and the Postgres logs tell me the command received SIGKILL). I don't have a timeout set, so it's not clear to me what would possibly do something like that. This happens after the child process has been running for hours.
Does parallel
have a default timeout (docs don't seem to suggest it does) or any other ideas on what could be causing this?
ETA: Add some stuff that helped me find this in the body of the question because it might help others who are having the same problem find this question.
In your Postgres logs you should find some messages like this:
LOG: received smart shutdown request
LOG: autovacuum launcher shutting down
FATAL: the database system is shutting down
that will have been generated despite you not asking Postgres to shut down.
So as mentioned in comments, the problem was the OOM killer. I fixed it by doing a couple things: