dockerdocker-swarmdocker-swarm-mode

Docker Swarm mesh routing doesn't work for independent subnets


I have a manager and a worker node, manager is on cloud and worker is my personal computer so they're on different sub-nets. Both are listed as ACTIVE.

My main problem is that creating a service and scaling it works as intended, both the manager and the worker starts a container etc but the mesh router doesn't work. The container contains a simple ping-pong type of server, if the scale is 1 and only the manager have the container running then I should be able to cURL my worker and get the response from the manager through worker node, right?

Load balancing works as expected if there is only one worker and many containers but if there is 3 workers and 3 containers are distributed amongst them, then load balancing does not work.

I made sure that needed ports are open;

IP Address Start Port End Port Start Port End Port Protocol Description Enabled
192.168.0.20 8080 8080 8080 8080 Both test-port Yes
192.168.0.20 7946 7946 7946 7946 Both Yes
192.168.0.20 4789 4789 4789 4789 UDP Yes
192.168.0.20 1234 1234 1234 1234 Both Yes
192.168.0.20 2377 2377 2377 2377 TCP Yes

When I inspect the ingress network it shows that both worker and manager in Peers attribute but worker's IP is shown in local.


   "Peers": [
        {
            "Name": "1fc94f7e314e",
            "IP": "95.***.***.***"
        },
        {
            "Name": "85d4a1a1b3f2",
            "IP": "192.168.0.20"
        },
    ]

Edit; Added tcpdump of port 7946. The 4789 port was silent.


        94.***.***.***.35388 > 95.***.***.***.7946: Flags [P.], cksum 0xea01 (correct), seq 1:302, ack 1, win 502, options [nop,nop,TS val 311036725 ecr 954017851], length 301
    14:21:04.266975 IP (tos 0x0, ttl 64, id 54940, offset 0, flags [DF], proto TCP (6), length 52)
        95.***.***.***.7946 > 94.***.***.***.35388: Flags [.], cksum 0x15df (incorrect -> 0x6e42), ack 302, win 507, options [nop,nop,TS val 954017925 ecr 311036725], length 0
    14:21:04.267014 IP (tos 0x0, ttl 47, id 49773, offset 0, flags [DF], proto TCP (6), length 52)
        94.***.***.***.35388 > 95.***.***.***.7946: Flags [.], cksum 0x6fbf (correct), ack 1, win 502, options [nop,nop,TS val 311036724 ecr 954017851], length 0
    14:21:04.267028 IP (tos 0x0, ttl 64, id 54941, offset 0, flags [DF], proto TCP (6), length 52)
        95.***.***.***.7946 > 94.***.***.***.35388: Flags [.], cksum 0x15df (incorrect -> 0x6e42), ack 302, win 507, options [nop,nop,TS val 954017925 ecr 311036725], length 0
    14:21:04.267720 IP (tos 0x0, ttl 64, id 54942, offset 0, flags [DF], proto TCP (6), length 328)
        95.***.***.***.7946 > 94.***.***.***.35388: Flags [P.], cksum 0x16f3 (incorrect -> 0xb31d), seq 1:277, ack 302, win 507, options [nop,nop,TS val 954017925 ecr 311036725], length 276
    14:21:04.267815 IP (tos 0x0, ttl 64, id 54943, offset 0, flags [DF], proto TCP (6), length 52)
        95.***.***.***.7946 > 94.***.***.***.35388: Flags [F.], cksum 0x15df (incorrect -> 0x6d2d), seq 277, ack 302, win 507, options [nop,nop,TS val 954017925 ecr 311036725], length 0
    14:21:04.341436 IP (tos 0x0, ttl 47, id 49775, offset 0, flags [DF], proto TCP (6), length 52)
        94.***.***.***.35388 > 95.***.***.***.7946: Flags [.], cksum 0x6cea (correct), ack 277, win 501, options [nop,nop,TS val 311036799 ecr 954017925], length 0
    14:21:04.341516 IP (tos 0x0, ttl 47, id 49776, offset 0, flags [DF], proto TCP (6), length 52)
        94.***.***.***.35388 > 95.***.***.***.7946: Flags [F.], cksum 0x6ce8 (correct), seq 302, ack 278, win 501, options [nop,nop,TS val 311036799 ecr 954017925], length 0
    14:21:04.341554 IP (tos 0x0, ttl 64, id 54944, offset 0, flags [DF], proto TCP (6), length 52)
        95.***.***.***.7946 > 94.***.***.***.35388: Flags [.], cksum 0x15df (incorrect -> 0x6c98), ack 303, win 507, options [nop,nop,TS val 954017999 ecr 311036799], length 0
    14:21:04.572411 IP (tos 0x0, ttl 64, id 31955, offset 0, flags [DF], proto UDP (17), length 115)
        95.***.***.***.7946 > 192.168.0.20.7946: UDP, length 87
    14:21:04.772361 IP (tos 0x0, ttl 64, id 31958, offset 0, flags [DF], proto UDP (17), length 115)
        95.***.***.***.7946 > 192.168.0.20.7946: UDP, length 87
    14:21:04.972568 IP (tos 0x0, ttl 64, id 31990, offset 0, flags [DF], proto UDP (17), length 115)
        95.***.***.***.7946 > 192.168.0.20.7946: UDP, length 87
    ^[[A14:21:05.172449 IP (tos 0x0, ttl 64, id 32014, offset 0, flags [DF], proto UDP (17), length 115)
        95.***.***.***.7946 > 192.168.0.20.7946: UDP, length 87
    14:21:05.372687 IP (tos 0x0, ttl 64, id 32045, offset 0, flags [DF], proto UDP (17), length 150)
        95.***.***.***.7946 > 192.168.0.20.7946: UDP, length 122
    14:21:05.416490 IP (tos 0x0, ttl 47, id 64487, offset 0, flags [DF], proto UDP (17), length 86)
        94.***.***.***.7946 > 95.***.***.***.7946: UDP, length 58
    14:21:05.416902 IP (tos 0x0, ttl 64, id 16979, offset 0, flags [DF], proto UDP (17), length 77)
        95.***.***.***.7946 > 94.***.***.***.7946: UDP, length 49
    14:21:05.873535 IP (tos 0x0, ttl 64, id 24571, offset 0, flags [DF], proto TCP (6), length 60)
        95.***.***.***.52398 > 192.168.0.20.7946: Flags [S], cksum 0x272d (incorrect -> 0x98a2), seq 1269859057, win 64240, options [mss 1460,sackOK,TS val 2080921355 ecr 0,nop,wscale 7], length 0
    14:21:06.875553 IP (tos 0x0, ttl 64, id 24572, offset 0, flags [DF], proto TCP (6), length 60)
        95.***.***.***.52398 > 192.168.0.20.7946: Flags [S], cksum 0x272d (incorrect -> 0x94b8), seq 1269859057, win 64240, options [mss 1460,sackOK,TS val 2080922357 ecr 0,nop,wscale 7], length 0
    14:21:07.067513 IP (tos 0x0, ttl 64, id 35091, offset 0, flags [DF], proto TCP (6), length 60)
        95.***.***.***.52396 > 192.168.0.20.7946: Flags [S], cksum 0x272d (incorrect -> 0x85cc), seq 2586264232, win 64240, options [mss 1460,sackOK,TS val 2080922549 ecr 0,nop,wscale 7], length 0
    14:21:07.372802 IP (tos 0x0, ttl 64, id 25130, offset 0, flags [DF], proto TCP (6), length 60)
        95.***.***.***.52400 > 192.168.0.20.7946: Flags [S], cksum 0x272d (incorrect -> 0xf2ef), seq 603998839, win 64240, options [mss 1460,sackOK,TS val 2080922854 ecr 0,nop,wscale 7], length 0
    14:21:07.416945 IP (tos 0x0, ttl 47, id 64591, offset 0, flags [DF], proto UDP (17), length 86)
        94.***.***.***.7946 > 95.***.***.***.7946: UDP, length 58
    14:21:07.417352 IP (tos 0x0, ttl 64, id 17071, offset 0, flags [DF], proto UDP (17), length 77)
        95.***.***.***.7946 > 94.***.***.***.7946: UDP, length 49
    14:21:08.379558 IP (tos 0x0, ttl 64, id 25131, offset 0, flags [DF], proto TCP (6), length 60)
        95.***.***.***.52400 > 192.168.0.20.7946: Flags [S], cksum 0x272d (incorrect -> 0xef00), seq 603998839, win 64240, options [mss 1460,sackOK,TS val 2080923861 ecr 0,nop,wscale 7], length 0
    14:21:08.572575 IP (tos 0x0, ttl 64, id 32740, offset 0, flags [DF], proto UDP (17), length 115)
        95.***.***.***.7946 > 192.168.0.20.7946: UDP, length 87

So to sum it all up; load balancing with mesh router works for local network but doesn't work for remote workers/containers.


Solution

  • I made sure that needed ports are open;

    but...

    Tcpdump was on the manager side, worker's tcpdump was completely empty for both ports.

    This is a sign the network is blocking the packets. There's more than one place to block the connections across the network, and only opening the ports on the hosts is often not enough. You'll need to identify the location the packets are getting blocked by checking each hop on the network with the owners of those devices.