I have an EKS cluster that uses external-dns controller to create DNS records in Route53 for ingresses. this has been working seamlessly until recently it started deleting and recreating sets of records causing the apps to go off and back online every minute.
here's an example of my ingress manifest:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: test-ingress
namespace: test
annotations:
external-dns.alpha.kubernetes.io/hostname: stg.test.domain.com
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/group.name: "staging-external"
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
alb.ingress.kubernetes.io/ssl-redirect: '443'
spec:
ingressClassName: alb
rules:
- host: "stg.test.domain.com"
http:
paths:
- pathType: Prefix
path: /
backend:
service:
name: test-service. ##service name
port:
number: 80
Edit External-dns pod logs
time="2025-01-10T08:51:45Z" level=debug msg="Refreshing zones list cache"
time="2025-01-10T08:51:45Z" level=debug msg="Considering zone: /hostedzone/<hostedzonename> (domain: domain.com.)"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service namespace/service-name"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service flux-system/notification-controller"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service flux-system/source-controller"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service kube-system/metrics-server"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service namespace/servicename"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service namespace/servicename"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service namespace/servicename"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service kube-system/aws-load-balancer-webhook-service"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service namespace/servicename"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service external-secrets/external-secrets-webhook"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service flux-system/webhook-receiver"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service namespace/servicename"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service namespace/servicename"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service default/external-dns"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service default/kubernetes"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service namespace/servicename"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service kube-system/eks-extension-metrics-api"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service kube-system/kube-dns"
time="2025-01-10T08:51:46Z" level=debug msg="No endpoints could be generated from service namespace/servicename"
time="2025-01-10T08:51:46Z" level=debug msg="Endpoints generated from ingress: namespace/service-name-ingress: [app1.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [] app1.domain.com 0 IN CNAME alb-FQDN.amazonaws.com []]"
time="2025-01-10T08:51:46Z" level=debug msg="Endpoints generated from ingress: namespace/servicename-ingress: [app2.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [] app2.domain.com 0 IN CNAME alb-FQDN.amazonaws.com []]"
time="2025-01-10T08:51:46Z" level=debug msg="Endpoints generated from ingress: namespace/servicename-ingress: [app3.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [] app3-backend.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [] app3.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [] app3-backend.domain.com 0 IN CNAME alb-FQDN.amazonaws.com []]"
time="2025-01-10T08:51:46Z" level=debug msg="Endpoints generated from ingress: namespace/servicename-ingress: [app4.domain.com 300 IN CNAME alb-FQDN.amazonaws.com [] app4.domain.com 300 IN CNAME alb-FQDN.amazonaws.com []]"
time="2025-01-10T08:51:46Z" level=debug msg="Endpoints generated from ingress: namespace/servicename-ingress: [app5.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [] app5.domain.com 0 IN CNAME alb-FQDN.amazonaws.com []]"
time="2025-01-10T08:51:46Z" level=debug msg="Removing duplicate endpoint app1.domain.com 0 IN CNAME alb-FQDN.amazonaws.com []"
time="2025-01-10T08:51:46Z" level=debug msg="Removing duplicate endpoint app2.domain.com 0 IN CNAME alb-FQDN.amazonaws.com []"
time="2025-01-10T08:51:46Z" level=debug msg="Removing duplicate endpoint app3.domain.com 0 IN CNAME alb-FQDN.amazonaws.com []"
time="2025-01-10T08:51:46Z" level=debug msg="Removing duplicate endpoint app3-backend.domain.com 0 IN CNAME alb-FQDN.amazonaws.com []"
time="2025-01-10T08:51:46Z" level=debug msg="Removing duplicate endpoint app4.domain.com 300 IN CNAME alb-FQDN.amazonaws.com []"
time="2025-01-10T08:51:46Z" level=debug msg="Removing duplicate endpoint app5.domain.com 0 IN CNAME alb-FQDN.amazonaws.com []"
time="2025-01-10T08:51:46Z" level=debug msg="Modifying endpoint: app1.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [], setting alias=true"
time="2025-01-10T08:51:46Z" level=debug msg="Modifying endpoint: app2.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [], setting alias=true"
time="2025-01-10T08:51:46Z" level=debug msg="Modifying endpoint: app3.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [], setting alias=true"
time="2025-01-10T08:51:46Z" level=debug msg="Modifying endpoint: app3-backend.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [], setting alias=true"
time="2025-01-10T08:51:46Z" level=debug msg="Modifying endpoint: app4.domain.com 300 IN CNAME alb-FQDN.amazonaws.com [], setting alias=true"
time="2025-01-10T08:51:46Z" level=debug msg="Modifying endpoint: app4.domain.com 300 IN A alb-FQDN.amazonaws.com [{alias true}], setting ttl=300"
time="2025-01-10T08:51:46Z" level=debug msg="Modifying endpoint: app5.domain.com 0 IN CNAME alb-FQDN.amazonaws.com [], setting alias=true"
time="2025-01-10T08:51:46Z" level=debug msg="Refreshing zones list cache"
time="2025-01-10T08:51:46Z" level=debug msg="Considering zone: /hostedzone/<hostedzonename> (domain: domain.com.)"
time="2025-01-10T08:51:46Z" level=info msg="Applying provider record filter for domains: [domain.com. .domain.com.]"
time="2025-01-10T08:51:46Z" level=debug msg="Refreshing zones list cache"
time="2025-01-10T08:51:46Z" level=debug msg="Considering zone: /hostedzone/<hostedzoneId> (domain: domain.com.)"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app1.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app1-backend.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app2.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app3.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app4.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app5.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app1.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding cname-app1.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app1-backend.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding cname-app1-backend.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app2.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding cname-app2.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app3.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding cname-app3.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app4.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding cname-app4.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding app5.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=debug msg="Adding cname-app5.domain.com. to zone domain.com. [Id: /hostedzone/<hostedzoneId>]"
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app3.domain.com A" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app3.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app2.domain.com A" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app2.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE cname-app3.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE cname-app2.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE cname-app1-backend.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE cname-app1.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE cname-app4.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE cname-app5.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app1-backend.domain.com A" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app1-backend.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app1.domain.com A" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app1.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app4.domain.com A" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app4.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app5.domain.com A" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="Desired change: CREATE app5.domain.com TXT" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
time="2025-01-10T08:51:46Z" level=info msg="18 record(s) were successfully updated" profile=default zoneID=/hostedzone/<hostedzoneId> zoneName=domain.com.
Just keeps repeating these actions
I figured out what was causing the problem.
So I have two almost identical clusters(Staging and Production), they both use the same hosted zone on Route53 in their external-dns controller so they both have access to all the records there. So the logs I wasn't checking were the logs on the external-dns controller on the production cluster which actually logged the DELETE events causing the staging cluster to continue recreating them.
This was fixed by adding the following argument to the external-dns deployment manifest to make sure each external-dns instance only has access to manage the records it created.
containers:
- name: external-dns
## other config ...
args:
- --txt-owner-id=unique.staging.cluster.string.id
## other args ...
The --txt-owner-id argument gives each record a unique string Id with which it will be managed without conflict.
Thanks to everyone for their time and suggestions