I'm using a helm chart to deploy my kubernetes 1.27 cluster across 4 RHEL9 Minimal Build VMs (one controller, three workers). The cluster seems to deploy, but all the pods are in crashloop with connectivity issues. Redis cannot initialize. This cluster works fine on RHEL8 and on RHEL7 VMs. Redis 6.2.12 errors:
Initializing config..
/readonly-config/init.sh: line 84: Could not resolve the announce ip for this pod: not found
Error from server (BadRequest): container "sentinel" in pod "xio-redis-ha-server-0" is waiting to start: PodInitializing
*** FATAL CONFIG FILE ERROR (Redis 6.2.12) ***
Reading the configuration file, at line 2
>>> 'sentinel down-after-milliseconds mymaster 10000'
No such master with specified name.
General connectivity errors from other pods:
Caused by: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: name-redis-ha.default.svc.cluster.local/10.42.0.22:6379
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: name-redis-ha.default.svc.cluster.local/10.42.2.27:6379
Caused by: java.net.ConnectException: Connection refused
Caused by: org.redisson.client.RedisConnectionException: Unable to connect to Redis server: name-redis-ha.default.svc.cluster.local/10.42.2.27:6379
I've tried opening all ports, rebooting servers, restarting docker service, and 20 other things I've found on various blogs and posts. curling services from within pods works intermittently. restarting firewalld allows curling from within pods, but pods still cannot connect to one another. i've tried changing the iptablesBackend to the various different options in the firewalld.conf, in case this is an issue with the various firewall interfaces conflicting with each other. cluster canal pods say they are set to auto detect the firewall backend.
I struggled with this for several days, so I ended up making a list of all the settings I had to change in a RHEL 9 to be able to run a kubernetes cluster successfully with redis. Here's the list! I was running this in an on premise VMware environment:
firewalld is know to conflict with the cluster (known issue REF1). If pods are crashing post install, restarting firewalld and docker on all nodes, then deleting crashing pods may resolve the issue:
sudo service firewalld status
sudo service firewalld restart
sudo service docker restart
If the issue persists, it may be necessary to stop firewalld on all nodes and restart the docker service:
sudo service firewalld stop
sudo service docker restart
If other steps from this troubleshooting guide are implemented, these firewalld steps will need to be repeated afterwards.
If selinux is enabled, it may cause connectivity issues and crashing pods. It may be necessary to disable selinux, or set it to permissive. Be sure to overwrite the original values of SELINUX:
sudo vi /etc/selinux/config
SELINUX=permissive
sudo reboot
VMware SecureBoot feature may interfere with operation of the cluster:
Network Manager is know to conflict with the RKE cluster (known issue REF1) and may cause connectivity issues and crashing pods. It may be necessary to create the /etc/NetworkManager/conf.d/canal.conf file with the following contents on each node, then reload NetworkManager and reboot each node:
sudo systemctl status NetworkManager
sudo vi /etc/NetworkManager/conf.d/canal.conf
[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:flannel*
sudo systemctl reload NetworkManager
sudo reboot
It may be necessary to create the /etc/sysctl.d/90-override.conf file with the following contents on each node and then reboot each node (this overrides the breaking STIG setting in the /etc/sysctl.d/99-sysctl.conf file, stipulated by CCE-84008-2):
sudo vi /etc/sysctl.d/90-override.conf
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.eth0.rp_filter = 0
net.ipv4.conf.lo.rp_filter = 0
sudo reboot
On each node:
sudo systemctl disable --now nm-cloud-setup.service nm-cloud-setup.timer
REFERENCES: