concourse

Concourse Worker on another server loses connection to Concourse Web


We have a Concourse Web Container and a Concourse Worker Container running on Server A (212.77.7.255 - real IP is conceiled). We use the latest Concourse Version 7.8.1.

As we ran out of Worker resources, we added another Concourse Worker Container running on Server B. The Worker on Server B has been running fine for about five days, but all of a sudden it is not able to connect anymore to Concourse Web on Server A.

The logs of the Worker on Server B say:

{
    "timestamp": "2022-07-12T11:15:59.542 985762Z",
    "level": "error",
    "source": "worker",
    "message": "worker.container-sweeper.tick.failed-to-connect-to-tsa",
    "data": {
        "error": "dial tcp 212.77.7.255:2222: i/o timeout",
        "session": "6.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5430446562",
    "level": "error",
    "source": "worker",
    "message": "worker.container-sweeper.tick.dial.failed-to-connect-to-any-tsa",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "6.4.2"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5430608042",
    "level": "error",
    "source": "worker",
    "message": "worker.container-sweeper.tick.failed-to-dial",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "6.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5430689532",
    "level": "error",
    "source": "worker",
    "message": "worker.container-sweeper.tick.failed-to-get-containers-to-destroy",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "6.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5541187512",
    "level": "error",
    "source": "worker",
    "message": "worker.volume-sweeper. tick.failed-to-connect-to-tsa",
    "data": {
        "error": "dial tcp 212.77.7.255:2222: i/o timeout",
        "session": "7.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5541648442",
    "level": "error",
    "source": "worker",
    "message": "worker.volume-sweeper.tick.dial.failed-to-connect-to-any-tsa",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "7.4.3"
    }
}{
    "timestamp": "2022-07-12T11:15:59.5541725932",
    "level": "error",
    "source": "worker",
    "message": "worker.volume-sweeper.tick.failed-to-dial",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "7.4"
    }
}{
    "timestamp": "2022-07-12T11:15:59.554179789Z",
    "level": "error",
    "source": "worker",
    "message": "worker.volume-sweeper. tick. failed-to-get-volume 3-to-destroy",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "7.4"
    }
}{
    "timestamp": "2022-07-12T11:16:04.5802200122",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.beacon. failed-to-connect-to-tsa",
    "data": {
        "error": "dial tcp 212.77.7.255:2222: i/o timeout",
        "session": "4.1"
    }
}{
    "timestamp": "2022-07-12T11:16:04.580284659Z",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.beacon.dial.failed-to-connect-to-any-tsa",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "4.1.10"
    }
}{
    "timestamp": "2022-07-12T11:16:04.5803353772",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.beacon.failed-to-dial",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "4.1"
    }
}{
    "timestamp": "2022-07-12T11:16:04.5803598682",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.beacon.exited-with-error",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "4.1"
    }
}{
    "timestamp": "2022-07-12T11:16:04.580372552Z",
    "level": "debug",
    "source": "worker",
    "message",
    "worker.beacon-runner.beacon.done",
    "data": {
        "session": "4.1"
    }
}{
    "timestamp": "2022-07-12T11:16:04.5803948792",
    "level": "error",
    "source": "worker",
    "message": "worker.beacon-runner.failed",
    "data": {
        "error": "all worker SSH gateways unreachable",
        "session": "4"
    }
}

The logs on Concourse Web on Server A show no entries of the Worker on Server B trying to connect. On Server B I'm able to connect to Concourse Web on Server A:

$ nc 212.77.7.255 2222
SSH-2.0-Go

We had this problem before, but we solved it by upgrading Concourse to the latest version 7.8.1. Now I'm running out of options where to debug this. What I've tried:

Nothing does help. What can I do to debug this further and make the Worker on Server B connect again?


Solution

  • We couldn't find out why the docker network did not allow connecting to Server A. As connections on the host machine were going through, we told docker to use the host network:

    services:
      concourse-worker:
        ...
        network-mode: host
        ...
    

    This solved the issue. Not a pretty workaround, as the docker container should have it's own separated network, but as there is nothing else running on this server it's fine.