kubernetesapache-kafkakubectlamazon-ekskube-proxy

Kubernetes: kafka pod rechability issue from another pod


I know the below information is not enough to trace the issue but still, I want some solution.

We have Amazon EKS cluster.

Currently, we are facing the reachability of the Kafka pod issue.

Environment:

Working:

Problem:

The issue is something related to kube-proxy. we need help to resolve this issue.

Can anyone help to guide me? Can I restart kube-proxy? Does it affect other pods/deployments?


Solution

  • I believe this problem is caused by AWS's NLB TCP-only nature (as mentioned in the comments).

    In a nutshell, your pod-to-pod communication fails when hairpin is needed.

    To confirm this is the root cause, you can verify that when the telnet works, kafka pod and client pod are not in the same EC2 node. And when they're in the same EC2 server, the telnet fails.

    There are (at least) two approaches to tackle this issue:

    1. Use K8s internal networking - Refer to k8s Service's URL

    Every K8s service has its own DNS FQDN for internal usage (meaning using k8s network only, without reaching the LoadBalancer and come back to k8s again). You can just telnet this instead of the NodePort via the LB. I.e. let's assume your kafka service is named kafka. Then you can just telnet kafka.svc.cluster.local (on the port exposed by kafka service)

    1. Use K8s anti-affinity to make sure client and kafka are never scheduled in the same node.

    Oh and as indicated in this answer you might need to make that service headless.