dockernetworkingdocker-composecontainerszabbix

Dockerized Zabbix: Server Can't Connect to the Agents by IP


Problem:

I'm trying to config a fully containerized Zabbix version 6.0 monitoring system on Ubuntu 20.04 LTS using the Zabbix's Docker-Compose repo found HERE.

The command I used to raise the Zabbix server and also a Zabbix Agent is:

docker-compose -f docker-compose_v3_ubuntu_pgsql_latest.yaml --profile all up -d

Although the Agent rises in a broken state and shows a "red" status, when I change its' IP address FROM 127.0.0.1 TO 172.16.239.6 (default IP Docker-Compose assigns to it) the Zabbix Server can now successfully connect and monitoring is established. HOWEVER: the Zabbix Server cannot connect to any other Dockerized Zabbix Agents on REMOTE hosts which are raised with the docker run command:

docker run --add-host=zabbix-server:172.16.238.3 -p 10050:10050 -d --privileged --name DockerHost3-zabbix-agent -e ZBX_SERVER_HOST="zabbix-server" -e ZBX_PASSIVE_ALLOW="true" zabbix/zabbix-agent:ubuntu-6.0-latest

NOTE: I looked at other Stack groups to post this question, but Stackoverflow appeared to be the go-to group for these Docker/Zabbix issues having over 30 such questions.

Troubleshooting:

Comparative Analysis:

Agent Configuration:

Comparative analysis of the working ("green") Agent on the same host as the Zabbix Server with Agents on different hosts showing "red" statuses (not contactable by the Zabbix server) using the following command show the configurations have parity.

docker exec -u root -it (ID of agent container returned from "docker ps") bash

And then execute:

grep -Ev ^'(#|$)' /etc/zabbix/zabbix_agentd.conf

Ports:

The correct ports were showing as open on the "red" Agents as were open on the "green" agent running on the same host as the Zabbix Server from the output of the command:

ss -luntu

NOTE: This command was issued from the HOST, not the Docker container for the Agent.

Firewalling:

Review of the iptables rules from the HOST (not container) using the following command didn't reveal anything of concern:

iptables -nvx -L --line-numbers

But to exclude Firewalling, I nonetheless allowed everything in iptables in the FORWARD table on both the Zabbix server and an Agent in an "red" status used for testing.

I also allowed everything on the MikroTik GW router connecting the Zabbix Server to the different physical hosts running the Zabbix Agents.

Routing:

The Zabbix server can ping remote Agent interfaces proving there's a route to the Agents.

AppArmor:

I also stopped AppArmor to exclude it as being causal:

sudo systemctl stop apparmor
sudo systemctl status apparmor

Summary:

So everything is wide-open, the Zabbix Server can route to the Agents and the config of the "red" agents have parity with the config of the "green" Agent living on the same host at the Zabbix Server itself.

I've setup non-containerized Zabbix installation in production environments successfully so I'm otherwise familiar with Zabbix.

Why can't the containerized Zabbix Server connect to the containerized Zabbix Agents on different hosts?


Solution

  • Short Answer:

    There was NOTHING wrong with the Zabbix config; this was a Docker-induced problem.

    docker logs <hostname of Zabbix server> revealed that there appeared to be NAT'ing happening on the Zabbix SERVER, and indeed there was.

    Docker was modifying iptables NAT table on the host running the Zabbix Server container causing the source address of the Zabbix Server to present as the IP of the physical host itself, not the Docker-Compose assigned IP address of 172.16.238.3.

    Thus, the agent was not expecting this address and refused the connection. My experience of Dockerized apps is that they are mostly good at modifying IP tables to create the correct connectivity, but not in this particular case ;-).

    I now reviewed the NAT table by executing the following command on the HOST (not container):

    iptables -t nat -nvx -L --line-numbers
    

    This revealed that Docker was being, erm "helpful" and NAT'ing the Zabbix server's traffic

    I deleted the offending rules by their rule number:

    iptables -t nat -D <chain> <rule #>
    

    After which the Zabbix server's IP address was now presented correctly to the Agents who now accepted the connections and their statuses turned "green".

    The problem is reproducible if you execute:

    docker-compose -f docker-compose -f docker-compose_v3_ubuntu_pgsql_latest.yaml down
    

    And then run the up command raising the containers again you'll see the offending iptables rule it restored to the NAT table of the host running the Zabbix Server's container breaking the connectivity with Agents.

    Longer Answer:

    Below are the steps required to identify and resolve the problem of the Zabbix server NAT'ing its' traffic out of the host's IP:

    Identify If the HOST of the Zabbix Server container is NAT'ing:

    We need to see how the IP of the Zabbix Server's container is presenting to the Agents, so we have to get the container ID for a Zabbix AGENT to review its' logs:

    docker ps
    CONTAINER ID   IMAGE                                   COMMAND                  CREATED       STATUS       PORTS                                           NAMES
    b2fcf38d601f   zabbix/zabbix-agent:ubuntu-6.0-latest   "/usr/bin/tini -- /u…"   5 hours ago   Up 5 hours   0.0.0.0:10050->10050/tcp, :::10050->10050/tcp   DockerHost3-zabbix-agent
    

    Next, supply container ID for the Agent to the docker logs command:

    docker logs b2fcf38d601f
    

    Then Review the rejected IP address in the log output to determine if it's NOT the Zabbix Server's IP:

    81:20220328:000320.589 failed to accept an incoming connection: connection from "NAT'ed IP" rejected, allowed hosts: "zabbix-server"
    

    The fact that you can see this error proves that there is no routing or connectivity issues: the connection is going through, it's just being rejected by the application- NOT the firewall.

    If NAT'ing proved, continue to next step

    On Zabbix SERVER's Host:

    The remediation happens on the Zabbix Server's Host itself, not the Agents. Which is good because we can fix the problem in one place versus many.

    Execute below command on the Host running the Zabbix Server's container:

    iptables -t nat -nvx -L --line-numbers
    

    Output of command:

    Chain POSTROUTING (policy ACCEPT 88551 packets, 6025269 bytes)
    num      pkts      bytes target     prot opt in     out     source               destination         
    1           0        0 MASQUERADE  all  --  *      !br-abeaa5aad213  192.168.24.128/28    0.0.0.0/0           
    2       73786  4427208 MASQUERADE  all  --  *      !br-05094e8a67c0  172.16.238.0/24      0.0.0.0/0  
    
    Chain DOCKER (2 references)
    num      pkts      bytes target     prot opt in     out     source               destination         
    1           0        0 RETURN     all  --  br-abeaa5aad213 *       0.0.0.0/0            0.0.0.0/0           
    2          95     5700 RETURN     all  --  br-05094e8a67c0 *       0.0.0.0/0            0.0.0.0/0
    

    We can see the counters are incrementing for the "POSTROUTING" and "DOCKER" chains- both rule #2 in their respective chains. These rules are clearly matching and have effect.

    Delete the offending rules on the HOST of the Zabbix server container which is NATing its' traffic to the Agents:

    sudo iptables -t nat -D POSTROUTING 2
    sudo iptables -t nat -D DOCKER 2
    

    Wait a few moments and the Agents should now go "green"- assuming there are no other configuration or firewalling issues. If the Agents remain "red" after applying the fix then please work through the troubleshooting steps I documented in the Question section.

    Conclusion:

    I've tested and restarting the Zabbix-server container does not recreate the deleted rules. But again, please note that a docker-compose down followed by a docker-compose up WILL recreate the deleted rules and break Agent connectivity.

    Hope this saves other folks wasted cycles. I'm a both a Linux and network engineer and this hurt my head, so this would be near impossible to resolve if you're not a dab hand with networking.