crash elixir phoenix-framework fault-tolerance erlang-supervisor

Killing Supervised process in Phoenix Framework causes the entire application to shutdown

I have a Phoenix application which creates the following supervision tree (taken from the erlang observer):

The restart strategy of the supervisor is :one_to_one. The expectation is that if I kill any of the Supervised processes, that individual process would be restarted instantaneously by the Supervisor. This is the only the case with the Telemetry process. If any of the highlighted processes are killed (using either the observer or Process.exit in iEX) the following occurs:

And here's the updated application tree afterwards, seems the entire Phoenix application crashes. All API requests cannot find the server:

Any ideas as to why this is the case? How can I implement the expected behaviour?

Solution

This is more of an issue of using Process.exit(pid, :kill) on a supervisor. :kill is a last resort signal, which causes the supervisor to immediately terminate, without notifying its children to terminate properly. So the supervisor exits, the children now have to notice their parent is dead, clean up, and terminate.

Meanwhile, the application is restarting part of the tree at the same time, which ends up conflicting with the old one still shutting down, causing another failure. This eventually triggers max_restarts and the application shuts down.

Overall:

Just use :kill as a last resort, when the process did not respond to any other exit signal. This especially applies to supervisors, as their only job is to ensure processes start and terminate accordingly, and sending a :kill voids that
If you want to simulate stopping a supervisor, Supervisor.stop will at least go through the usual flow
Run iex --logger-sasl-reports true -S mix phx.server to get precise logging from supervisors (which is what I used to debug this)

One question that may arise from this is: could my supervisor fail in a way that triggers the same behaviour as the :kill signal? Supervisors trap exits, which means no other process (linked or not) could cause them to crash. Therefore, this can only happen if there is a bug in the supervisor. And if there is a bug in the supervisor, then indeed they can't guarantee their fault tolerant properties anyway. That's why supervisors rarely change, they have been strongly tested for decades and are an essential piece of your application.