When using GitLab Auto DevOps to build and deploy application from my repository to microk8s, the build jobs often take a long time to run, eventually timing out. The issue happens 99% of the time, but some builds run through. Often, the build stops at a different time in the build script.
The projects do not contain a .gitlab-ci.yml
file and fully rely on the Auto DevOps feature to do its magic.
For Spring Boot/Java projects, the build often fails when downloading the Gradle via the Gradle wrapper, other times it fails while downloading the dependencies itself. The error message is very vague and not helpful at all:
Step 5/11 : RUN /bin/herokuish buildpack build
---> Running in e9ec110c0dfe
-----> Gradle app detected
-----> Spring Boot detected
The command '/bin/sh -c /bin/herokuish buildpack build' returned a non-zero code: 35
Sometimes, if you get lucky, the error is different:
Step 5/11 : RUN /bin/herokuish buildpack build
---> Running in fe284971a79c
-----> Gradle app detected
-----> Spring Boot detected
-----> Installing JDK 11... done
-----> Building Gradle app...
-----> executing ./gradlew build -x check
Downloading https://services.gradle.org/distributions/gradle-7.0-bin.zip
..........10%...........20%...........30%..........40%...........50%...........60%...........70%..........80%...........90%...........100%
To honour the JVM settings for this build a single-use Daemon process will be forked. See https://docs.gradle.org/7.0/userguide/gradle_daemon.html#sec:disabling_the_daemon.
Daemon will be stopped at the end of the build
> Task :compileJava
> Task :compileJava FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':compileJava'.
> Could not download netty-resolver-dns-native-macos-4.1.65.Final-osx-x86_64.jar (io.netty:netty-resolver-dns-native-macos:4.1.65.Final)
> Could not get resource 'https://repo.maven.apache.org/maven2/io/netty/netty-resolver-dns-native-macos/4.1.65.Final/netty-resolver-dns-native-macos-4.1.65.Final-osx-x86_64.jar'.
> Could not GET 'https://repo.maven.apache.org/maven2/io/netty/netty-resolver-dns-native-macos/4.1.65.Final/netty-resolver-dns-native-macos-4.1.65.Final-osx-x86_64.jar'.
> Read timed out
For React/TypeScript projects, the symptoms are similar but the error itself manifests in a different way:
[INFO] Using npm v8.1.0 from package.json
/cnb/buildpacks/heroku_nodejs-npm/0.4.4/lib/build.sh: line 179: /layers/heroku_nodejs-engine/toolbox/bin/yj: Permission denied
ERROR: failed to build: exit status 126
ERROR: failed to build: executing lifecycle: failed with status code: 145
The problem seems to occur mostly when the GitLab runners itself are deplyoed in Kubernetes. microk8s uses Project Calico to implement virtual networks.
What gives? Why are the error messages to unhelpful? Is there a way to turn up verbose build logs or debug the build steps?
This seems to be a networking problem caused by incompatbile MTU settings between the Calico network layer and Docker's network configuration (and an inability to autoconfige the MTU correctly?) When the MTU values don't match, network packets get fragmented and the Docker runners fail to complete TLS handshakes. As far as I understand, this only affects DIND (docker-in-docker) runners.
Even finding this out requires jumping through a few hoops. You have to:
kubectl exec
into the current/active GitLab runner podDOCKER_HOST
environment variable (e.g. by grepping through /proc/$pid/environ
. Very likely, this will be tcp://localhost:2375
.docker
client: export DOCKER_HOST=tcp://localhost:2375
docker ps
and then docker exec
into the actual CI job containerExecute
microk8s kubectl get -n kube-system cm calico-config -o yaml
and look for the veth_mtu
value, which will very likely be set to 1440
. DIND uses the same MTU and thus fails send or receive certain network packages (each virtual network needs to add its own header to the network packet, which adds a few bytes at every layer).
The naïve fix would be to change the Calico settings to a higher or lower value, but somehow this did not really work, even after the Calico deployment. Furthermore, the value seems to be reset to its original value from time to time; probably caused by automatic updates to microk8s (which comes as a Snap).
So what is a solution that actually works and is permanent? It is possible to override DIND settings for Auto DevOps by writing a custom .gitlab-ci.yml
file and simply including the Auto DevOps template:
build:
services:
- name: docker:20.10.6-dind # make sure to update version
command: ['--tls=false', '--host=tcp://0.0.0.0:2375', '--mtu=1240']
include:
- template: Auto-DevOps.gitlab-ci.yml
The build.services
definition is copied from the Jobs/Build.gitlab-ci
template and extended with an additional --mtu
option.
I've had good experience so far by setting the DIND MTU to 1240, which is 200 bytes lower than Calico's MTU. As an added bonus, it doesn't affect any other pods' network settings. And for CI builds I can live with non-optimal network settings.
The symptoms started showing again: slow pipelines with high failure rate (>80%) sometimes related to network timeouts and sometimes seemingly random errors.
If you run microk8s 1.24+ with Calico 3.21+, make sure to set veth_mtu
in the Calico config map to "0"
. If you have upgraded from an earlier version, chances are high the configmap still sets it to a non-zero value such a "1440"
.
Check your current values with:
kubectl -n kube-system get -oyaml daemonset.apps/calico-node | grep 'image:'
kubectl -n kube-system get -oyaml configmap/calico-config | grep 'veth_mtu:'
Setting it to 0 seems to properly auto-detect the correct MTU value. The workaround with manually specifying a lower MTU in the .gitlab-ci.yml
file is no longer required and the manual server override can be removed.
References: