I did the following to deploy a helm chart (you can copy-and-paste my sequence of commands to reproduce this error).
$ flux --version
flux version 0.16.1
$ kubectl create ns traefik
$ flux create source helm traefik --url https://helm.traefik.io/traefik --namespace traefik
$ cat values-6666.yaml
ports:
traefik:
healthchecksPort: 6666 # !!! Deliberately wrong port number!!!
$ flux create helmrelease my-traefik --chart traefik --source HelmRepository/traefik --chart-version 9.18.2 --namespace traefik --values=./values-6666.yaml
✚ generating HelmRelease
► applying HelmRelease
✔ HelmRelease created
◎ waiting for HelmRelease reconciliation
✔ HelmRelease my-traefik is ready
✔ applied revision 9.18.2
So Flux reports it as a success, and can be confirmed like this:
$ flux get helmrelease --namespace traefik
NAME READY MESSAGE REVISION SUSPENDED
my-traefik True Release reconciliation succeeded 9.18.2 False
But in fact, as shown above, values-6666.yaml
contains a deliberately wrong port number 6666 for pod's readiness probe (as well as liveness probe):
$ kubectl -n traefik describe pod my-traefik-8488cc49b8-qf5zz
...
Type Reason ... From Message
---- ------ ... ---- -------
Warning Unhealthy ... kubelet Liveness probe failed: Get "http://172.31.61.133:6666/ping": dial tcp 172.31.61.133:6666: connect: connection refused
Warning Unhealthy ... kubelet Readiness probe failed: Get "http://172.31.61.133:6666/ping": dial tcp 172.31.61.133:6666: connect: connection refused
Warning BackOff ... kubelet Back-off restarting failed container
My goal is to have FluxCD automatically detect the above error. But, as shown above, FluxCD deems it a success.
Either of the following deployment methods would have detected that failure:
$ helm upgrade --wait ...
or
$ argocd app sync ... && argocd app wait ...
So, is there something similar in FluxCD to achieve the same effect?
====================================================================
P.S. Flux docs here seems to suggest that the equivalent to helm --wait
is already the default behaviour in FluxCD. My test above shows that it isn't. Furthermore, in the following example, I explicitly set it to disableWait: false
but the result is the same.
$ cat helmrelease.yaml
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: my-traefik
namespace: traefik
spec:
chart:
spec:
chart: traefik
sourceRef:
kind: HelmRepository
name: traefik
version: 9.18.2
install:
disableWait: false # !!! Explicitly set this flag !!!
interval: 1m0s
values:
ports:
traefik:
healthchecksPort: 6666
$ kubectl -n traefik create -f helmrelease.yaml
helmrelease.helm.toolkit.fluxcd.io/my-traefik created
## Again, Flux deems it a success:
$ flux get hr -n traefik
NAME READY MESSAGE REVISION SUSPENDED
my-traefik True Release reconciliation succeeded 9.18.2 False
## Again, the pod actually failed:
$ kubectl -n traefik describe pod my-traefik-8488cc49b8-bmxnv
... // Same error as earlier
Helm considers a deployment with one replica and strategy rollingUpdate with maxUnavailable of 1 to be ready when it has been deployed and there is 1 unavailable pod. If you test Helm itself, I believe you will find the same behavior exists in the Helm CLI / Helm SDK package upstream.
(Even if the deployment's one and only pod has entered CrashLoopBackOff and readiness and liveness checks have all failed... with maxUnavailable of 1 and replicas of 1, the deployment technically has no more than the allowed number of unavailable pods, so it is considered ready.)
This question was re-raised recently at: https://github.com/fluxcd/helm-controller/issues/355 and I provided more in-depth feedback there.
Anyway, as for the source of this behavior which is seemingly/clearly not what the user wanted (even if it appears to be specifically what the user has asked for, which is perhaps debatable):
As for Helm, this appears to be the same issue reported at GitHub here: