I'm building a distributed system running on Azure Container Apps, where graceful shutdowns of containers (e.g., to flush state, finalize jobs, etc.) are important for ensuring data integrity and consistency.
In this setup:
Scaling is handled via KEDA.
Data Ingress is disabled, meaning the app has no HTTP endpoints or health probes configured.
The container does not receive any Kubernetes liveness or readiness probes.
I understand that Azure Container Apps (backed by Kubernetes) typically try to gracefully shut down containers using SIGTERM and allow them to exit cleanly within a grace period (terminationGracePeriodSeconds or similar).
However, I'm trying to assess how reliable this behavior is in practice:
Are non-graceful shutdowns (e.g., SIGKILL, container crashes, node failures) a common occurrence and something I should defensively design for?
Or are they rare events, only expected in exceptional cases such as hardware faults, power outages, or severe node-level failures?
Is it safe to assume graceful shutdowns are the default and reliable behavior, or should I treat non-graceful termination as the norm?
Any input or real-world experience with Azure Container Apps, KEDA, or similar Kubernetes-based environments (e.g., AKS, GKE, EKS) would be greatly appreciated.
In production, it's always best to be prepared for the worst. Unexpected failures can happen at any time for many reasons. Azure Container Apps are generally stable, but things can still go wrong—especially during upgrades or system changes.
For example, I recently faced an issue with my cluster. I upgraded the master node and three worker nodes without any problems. But when I started working on the fourth node, something went wrong. The drain command didn’t work, and all my pods got stuck in a terminating state.
I followed the same steps as before, but for some reason, this time it failed. This showed me that even when everything seems fine, unexpected issues can happen.
That’s why it’s important to plan for failures. If we prepare for things like forced shutdowns, crashes, or node failures, we can save time and effort and keep our system running smoothly. Does not matter if this happen often or not but you should be ready for it.