kubernetesredisconnectexceptioncrashloopbackoffrhel9

kubernetes 1.27 cluster connectivity issues on RHEL9 Minimal Build?


I'm using a helm chart to deploy my kubernetes 1.27 cluster across 4 RHEL9 Minimal Build VMs (one controller, three workers). The cluster seems to deploy, but all the pods are in crashloop with connectivity issues. Redis cannot initialize. This cluster works fine on RHEL8 and on RHEL7 VMs. Redis 6.2.12 errors:

Initializing config..
/readonly-config/init.sh: line 84: Could not resolve the announce ip for this pod: not found

Error from server (BadRequest): container "sentinel" in pod "xio-redis-ha-server-0" is waiting to start: PodInitializing

*** FATAL CONFIG FILE ERROR (Redis 6.2.12) ***
Reading the configuration file, at line 2
>>> 'sentinel down-after-milliseconds mymaster 10000'
No such master with specified name.

General connectivity errors from other pods:

Caused by: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: name-redis-ha.default.svc.cluster.local/10.42.0.22:6379
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: name-redis-ha.default.svc.cluster.local/10.42.2.27:6379
Caused by: java.net.ConnectException: Connection refused
Caused by: org.redisson.client.RedisConnectionException: Unable to connect to Redis server: name-redis-ha.default.svc.cluster.local/10.42.2.27:6379

I've tried opening all ports, rebooting servers, restarting docker service, and 20 other things I've found on various blogs and posts. curling services from within pods works intermittently. restarting firewalld allows curling from within pods, but pods still cannot connect to one another. i've tried changing the iptablesBackend to the various different options in the firewalld.conf, in case this is an issue with the various firewall interfaces conflicting with each other. cluster canal pods say they are set to auto detect the firewall backend.


Solution

  • I struggled with this for several days, so I ended up making a list of all the settings I had to change in a RHEL 9 to be able to run a kubernetes cluster successfully with redis. Here's the list! I was running this in an on premise VMware environment:

    FIREWALLD

    firewalld is know to conflict with the cluster (known issue REF1). If pods are crashing post install, restarting firewalld and docker on all nodes, then deleting crashing pods may resolve the issue:

        sudo service firewalld status
        sudo service firewalld restart
        sudo service docker restart
    

    If the issue persists, it may be necessary to stop firewalld on all nodes and restart the docker service:

        sudo service firewalld stop
        sudo service docker restart
    

    If other steps from this troubleshooting guide are implemented, these firewalld steps will need to be repeated afterwards.

    SELINUX

    If selinux is enabled, it may cause connectivity issues and crashing pods. It may be necessary to disable selinux, or set it to permissive. Be sure to overwrite the original values of SELINUX:

        sudo vi /etc/selinux/config
    
    SELINUX=permissive
    
        sudo reboot
    

    SECUREBOOT

    VMware SecureBoot feature may interfere with operation of the cluster:

    NETWORK MANAGER

    Network Manager is know to conflict with the RKE cluster (known issue REF1) and may cause connectivity issues and crashing pods. It may be necessary to create the /etc/NetworkManager/conf.d/canal.conf file with the following contents on each node, then reload NetworkManager and reboot each node:

        sudo systemctl status NetworkManager
        sudo vi /etc/NetworkManager/conf.d/canal.conf
    
    [keyfile]
    unmanaged-devices=interface-name:cali*;interface-name:flannel*
    
        sudo systemctl reload NetworkManager
        sudo reboot
    

    SYSCTL RP_FILTER

    It may be necessary to create the /etc/sysctl.d/90-override.conf file with the following contents on each node and then reboot each node (this overrides the breaking STIG setting in the /etc/sysctl.d/99-sysctl.conf file, stipulated by CCE-84008-2):

        sudo vi /etc/sysctl.d/90-override.conf
    
    net.ipv4.conf.all.rp_filter = 0
    net.ipv4.conf.default.rp_filter = 0
    net.ipv4.conf.eth0.rp_filter = 0
    net.ipv4.conf.lo.rp_filter = 0
    
        sudo reboot
    

    NM-CLOUD SERVICE AND TIMER

    On each node:

    sudo systemctl disable --now nm-cloud-setup.service nm-cloud-setup.timer
    

    REFERENCES:

    1. https://docs.rke2.io/known_issues