After some change made to docker build scripts used in Jenkins pipelines and/or a Docker upgrade I've started experiencing serious image data leaks. They manifest themselves in an unstable, constantly growing size of the images directory /var/lib/docker/overlay2
, which is always larger than the Images
reachable and cleanly removable by the Docker CLI (as shown by docker system df
). On a busy build server this means some 1 TB of image data leaking per day, which cannot be pruned using clean Docker CLI methods (such as docker system prune
) and have to be removed manually (the entire /var/lib/docker
).
How to debug and find the cause of such a data leak in a scalable way, which would be applicable in general, not just in this case [1]?
I'm talking about situations such as these (immediately after docker system prune -af
):
$ sudo du -sch /var/lib/docker/overlay2; echo ""; docker system df
68G /var/lib/docker/overlay2
68G total
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 0 0 0B 0B
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
Nearly all these unreachable leaked leftovers are in the "diff" subfolders, but that would not be specific enough to distinguish them from useful layers, also stored in such subfolders:
$ find | grep diff | wc -l
989599
# vs.
$ find | grep -v diff | wc -l
792
The contents of these leaked folders also look quite non-suspicious (all file extensions, both compiled from source and shipped as pre-compiled libraries):
./268713830f68824c1f6f5b21aad89055c18e8c0e6f1751654ec82ca4caf5bba9/diff/opt/conda/lib/python3.11/site-packages/sklearn/ensemble/_weight_boosting.py
./268713830f68824c1f6f5b21aad89055c18e8c0e6f1751654ec82ca4caf5bba9/diff/opt/conda/lib/python3.11/site-packages/sklearn/ensemble/_gradient_boosting.cpython-311-x86_64-linux-gnu.so
./268713830f68824c1f6f5b21aad89055c18e8c0e6f1751654ec82ca4caf5bba9/diff/opt/conda/lib/python3.11/site-packages/sklearn/ensemble/tests/test_gradient_boosting.py
./268713830f68824c1f6f5b21aad89055c18e8c0e6f1751654ec82ca4caf5bba9/diff/opt/conda/lib/python3.11/site-packages/sklearn/ensemble/tests/__pycache__/test_gradient_boosting.cpython-311.pyc
./268713830f68824c1f6f5b21aad89055c18e8c0e6f1751654ec82ca4caf5bba9/diff/opt/conda/lib/python3.11/site-packages/sklearn/ensemble/tests/__pycache__/test_weight_boosting.cpython-311.pyc
./268713830f68824c1f6f5b21aad89055c18e8c0e6f1751654ec82ca4caf5bba9/diff/opt/conda/lib/python3.11/site-packages/sklearn/ensemble/tests/test_weight_boosting.py
./268713830f68824c1f6f5b21aad89055c18e8c0e6f1751654ec82ca4caf5bba9/diff/opt/conda/lib/python3.11/site-packages/scipy/special/tests/data/boost.npz
./268713830f68824c1f6f5b21aad89055c18e8c0e6f1751654ec82ca4caf5bba9/diff/opt/conda/lib/python3.11/site-packages/scipy/special/tests/test_boost_ufuncs.py
I also upgraded from docker build
to docker buildx
(Moby BuildKit builder toolkit), and at least after a few hours of test builds of one large container the situation looked more promising, because /var/lib/docker/overlay2
was stable and never grew more than the size needed for the images, as indicated by docker system df
. However, because the storage could not be reclaimed after the containers were deleted (and docker system was pruned), then over time the builds will still take up all available space, it's only a question of when, not if:
$ sudo du -sch /var/lib/docker/overlay2; echo ; docker system df build-srv-3-fi: Sun Jul 28 14:43:42 2024
117G /var/lib/docker/overlay2
117G total
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 0 0 0B 0B
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
(this was under docker buildx version v0.15.1 1c1dbb2
)
This issue doesn't seem to be affected by the methods used for cleaning the data (e.g. removing first images and then build cache does not work any better than using the all-in method docker system prune
).
[1] i.e. not restricted to my rather complex setup with dockerized Jenkins run in Docker Compose building (in parallel) dozens of pipelines with partially overlapping image layers and/or inter-connected (multi-stage) containers, requiring 4 additional service containers (such as Jenkins) to build each "business" container.
Edit: the --one-file-system
switch of du
, which was recommended by some, made no difference, and shows the problem of disk use is real, not just an artifact of double-counting caused by the layering file system:
$ sudo du -sch /var/lib/docker/overlay2; echo ""; sudo du -sch --one-file-system /var/lib/docker/overlay2; echo ""; docker system df
452G /var/lib/docker/overlay2
452G total
452G /var/lib/docker/overlay2
452G total
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 0 0 0B 0B
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
Edit: A more extreme storage usage scenario (where du
no longer works (in finite time) on /var/lib/docker/overlay2
, presumably due to its complexity):
$ df -haT
Filesystem Type Size Used Avail Use% Mounted on
[..]
/dev/md2 ext4 3.4T 3.2T 5.3G 100% /
$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 333 0 2.873TB 2.873TB (100%)
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 2532 0 55.14GB 55.14GB
$ docker system prune -af
Total reclaimed space: 812.8GB
$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 0 0 0B 0B
Containers 0 0 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
...vs. the reality: the amount of space that docker system prune
claimed to have reclaimed was underestimated by half - 0.4T rather than 0.8T and we still have a few times more leaked images data unreachable to docker stored in /var/lib/docker/overlay2
). Notice also that even the prediction of what would be reclaimable made by docker system df
turned out to be off by some 2.5T (when confronted with the reality). Only the build cache limit from docker buildx
settings (set at 50 GB) was more or less working - it was exceeded by just 10%.
$ df -haT
Filesystem Type Size Used Avail Use% Mounted on
[..]
/dev/md2 ext4 3.4T 2.8T 400G 88% /
# restarting the docker daemon and socket does not help either:
$ sudo systemctl restart docker docker.socket
$ df -haT
Filesystem Type Size Used Avail Use% Mounted on
[..]
/dev/md2 ext4 3.4T 2.8T 400G 88% /
The issue in docker system prune -af
not being able to clean all container images data seems to have been resolved by Docker Engine v27.2.1. I looked into the release notes, but could not identify any relevant ones (related to pruning) by this version.
# before:
$ df -haT
Filesystem Type Size Used Avail Use% Mounted on
[..]
/dev/md2 ext4 3.4T 3.2T 20G 100% /
[..]
# the cleanup (taking just as long as `rm -rf /var/lib/docker` would)
$ docker system prune -af
[..]
# results:
$ sudo du -sch /var/lib/docker/overlay2; echo ""; sudo du -sch --one-file-system /var/lib/docker/overlay2; echo ""; docker system df
1.1M /var/lib/docker/overlay2
1.1M total
1.1M /var/lib/docker/overlay2
1.1M total
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 0 0 0B 0B
Containers 0 0 0B 0B
Local Volumes 1 0 914.2kB 914.2kB (100%)
Build Cache 0 0 0B 0B
# after:
$ df -haT
Filesystem Type Size Used Avail Use% Mounted on
[..]
/dev/md2 ext4 3.4T 158G 3.0T 5% /
[..]
# versions:
$ docker version
Client: Docker Engine - Community
Version: 27.2.1
API version: 1.47
Go version: go1.22.7
Git commit: 9e34c9b
Built: Fri Sep 6 12:08:10 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 27.2.1
API version: 1.47 (minimum version 1.24)
Go version: go1.22.7
Git commit: 8b539b8
Built: Fri Sep 6 12:08:10 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.22
GitCommit: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
runc:
Version: 1.1.14
GitCommit: v1.1.14-0-g2c9f560
docker-init:
Version: 0.19.0
GitCommit: de40ad0