elasticsearchkubernetesnetwork-programmingkubernetes-networking

Pods in Kubernetes can't see each other (Temporary failure in name resolution ,even for kubernetes.default.svc.cluster.local)


I deployed a Fluent-bit daemonSet, & an Elasticsearch (with a NodePort service) using these manifests:

fluent-bit-daemonset.yml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  labels:
    k8s-app: fluent-bit-logging
    version: v1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: fluent-bit-logging
      version: v1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: fluent-bit-logging
        version: v1
        kubernetes.io/cluster-service: "true"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "2020"
        prometheus.io/path: /api/v1/metrics/prometheus
    spec:
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:0.14.2
        ports:
        - containerPort: 2020
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch-0.elasticsearch.default.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT 
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        - name: mnt
          mountPath: /mnt
          readOnly: true
      terminationGracePeriodSeconds: 10
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      - name: mnt
        hostPath:
          path: /mnt
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule

elasticsearch.yml:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app : elasticsearch
    component: elasticsearch
    release: elasticsearch
  name: elasticsearch
spec:
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app : elasticsearch
      component: elasticsearch
      release: elasticsearch
  serviceName: elasticsearch
  template:
    metadata:
      creationTimestamp: null
      labels:
        app : elasticsearch
        component: elasticsearch
        release: elasticsearch
    spec:
      containers:
      - env:
        - name: cluster.name
          value: dev-cluster
        - name: discovery.type
          value: single-node
        - name: ES_JAVA_OPTS
          value: -Xms512m -Xmx512m
        - name: bootstrap.memory_lock
          value: "false"
        - name: xpack.security.enabled
          value: "false"
        image: elasticsearch:8.12.0
        imagePullPolicy: IfNotPresent
        name: elasticsearch
        ports:
        - containerPort: 9200
          name: http
          protocol: TCP
        - containerPort: 9300
          name: transport
          protocol: TCP
        resources:
          limits:
            cpu: 1
            memory: 3Gi
          requests:
            cpu: 250m
            memory: 512Mi
        securityContext:
          privileged: true
          runAsUser: 1000
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: elasticsearch-data
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - sh
        - -c
        - chown -R 1000:1000 /usr/share/elasticsearch/data
        - sysctl -w vm.max_map_count=262144
        - chmod 777 /usr/share/elasticsearch/data
        - chomod 777 /usr/share/elasticsearch/data/node
        - chmod g+rwx /usr/share/elasticsearch/data
        - chgrp 1000 /usr/share/elasticsearch/data
        image: busybox:1.29.2
        imagePullPolicy: IfNotPresent
        name: set-dir-owner
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: elasticsearch-data
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 10
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - metadata:
      creationTimestamp: null
      name: elasticsearch-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi

elastic-service.yml :

---
apiVersion: v1
kind: Service
metadata:
  name: eks-srv
spec:
  selector:
    app: elasticsearch
    component: elasticsearch
  ports:
    - name: db
      protocol: TCP
      port: 9200
      targetPort: 9200
    - name: monitoring
      protocol: TCP
      port: 9300
      targetPort: 9300
  type: NodePort


But when I log the Fluent-bit pod, it gives the following: k logs pods/fluent-bit-bwcl6

[2024/01/31 04:58:05] [ info] [engine] started (pid=1)
[2024/01/31 04:58:06] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2024/01/31 04:58:06] [ info] [filter_kube] local POD info OK
[2024/01/31 04:58:06] [ info] [filter_kube] testing connectivity with API server...
[2024/01/31 04:58:06] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:58:06] [error] [filter_kube] upstream connection error
[2024/01/31 04:58:06] [ warn] [filter_kube] could not get meta for POD fluent-bit-bwcl6
[2024/01/31 04:58:06] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2024/01/31 04:58:11] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:58:11] [error] [filter_kube] upstream connection error
[2024/01/31 04:58:21] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:58:21] [error] [filter_kube] upstream connection error
[2024/01/31 04:58:21] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:58:21] [error] [filter_kube] upstream connection error
[2024/01/31 04:58:31] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:58:31] [error] [filter_kube] upstream connection error
getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:58:41] [error] [filter_kube] upstream connection error
[2024/01/31 04:58:51] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:58:51] [error] [filter_kube] upstream connection error
[2024/01/31 04:59:16] [ warn] net_tcp_fd_connect: getaddrinfo(host='elasticsearch-0.elasticsearch.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:59:21] [ warn] net_tcp_fd_connect: getaddrinfo(host='elasticsearch-0.elasticsearch.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:59:27] [ warn] net_tcp_fd_connect: getaddrinfo(host='elasticsearch-0.elasticsearch.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:59:37] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:59:37] [error] [filter_kube] upstream connection error
[2024/01/31 04:59:37] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:59:37] [error] [filter_kube] upstream connection error
[2024/01/31 04:59:47] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:59:47] [error] [filter_kube] upstream connection error
[2024/01/31 04:59:47] [ warn] net_tcp_fd_connect: getaddrinfo(host='kubernetes.default.svc.cluster.local'): Temporary failure in name resolution
[2024/01/31 04:59:47] [error] [filter_kube] upstream connection error
[2024/01/31 04:59:57] [ warn] net_tcp_fd_connect: getaddrinfo(host='elasticsearch-0.elasticsearch.default.svc.cluster.local'): Temporary failure in name resolution

.
.
.

Why it can't see the elasticsearch?

Both are deployed in same namespace (default).


Solution

  • The problem was:

    1. The DNS of my kubernetes (or maybe, my whole VM's DNS) was having problem. So, I:

      1. Deleted node-local-dns pods, because they cached the wrong DNS servers for my node (I had a single node cluster, so I didn't need node-local-dns as a cache for my node)
      2. Set the correct NEW dns in CoreDNS manifest. This solved the problem of not recognizing kubernetes.default.svc.cluster.local
    2. I had another problem, which led to not recognizing the elasticsearch: As you can see in my fluent-bit's manifest, I had set:

    env:
            - name: FLUENT_ELASTICSEARCH_HOST
              value: "elasticsearch-0.elasticsearch.default.svc.cluster.local"
    

    This is WRONG. It should be the service name of elasticsearch (in my case, it was eks-srv as you can see in elastic-service.yml:

    value: "eks-srv.default.svc.cluster.local"
    

    And now my fluent-bit, sees the elasticsearch, through eks-srv service.