We make use of Github Self-Hosted action runners running on EC2 machines (m5.xlarge). We use these as part of our CI/CD pipeline to support docker image builds and automated testing. This solution has worked fine for the last year or so, but all of a sudden yesterday, the builds started to fail with the following error message :
time="2023-02-03T12:00:13Z" level=error msg="error waiting for container: unexpected EOF"
My understanding of this is that it is typically due to docker containers running out of resources (CPU / Memory Limit) being hit but given that these are m5.xlarges (4 vCPU and 16GB Memory) I'm a little surprised. Our builds make use of NPM which I understand can be quite resource hungry but monitoring a container during its execution showed that it was nowhere near the limits of the node:
I've tried to cycle the nodes but there is no difference in behaviour. The following user-data script is used with these nodes which connects it to our Github account and makes it available for jobs. I've also tried using the latest actions-runneer package, but again, no change in behaviour. What other reasons could this error be thrown for as i'm a bit stumped by this.
#!/bin/sh
set -e
curl https://get.docker.com | bash
apt install -y python3-pip jq
pip3 install awscli
mkdir actions-runner && cd actions-runner
curl -O -L https://github.com/actions/runner/releases/download/v2.286.0/actions-runner-linux-x64-2.286.0.tar.gz
tar xzf ./actions-runner-linux-x64-2.286.0.tar.gz
chown -R ubuntu:ubuntu .
instance_id="$(curl -s http://169.254.169.254/latest/meta-data/instance-id)"
url="https://api.github.com/orgs/<REMOVED>/actions/runners/registration-token"
token=$(curl -s -u "<REMOVED>:<REMOVED>" -X POST "$url" | jq -r .token)
sudo -u ubuntu ./config.sh \
--name "products-stage-ec2-runner-$instance_id" \
--token "$token" \
--url "https://github.com/<REMOVED>" \
--labels "<REMOVED>" \
--unattended
sudo ./svc.sh install
sudo ./svc.sh start
See details on my comment of how I resolved this.