dockerdocker-layer

Can I obtain the Docker layer history on non-final stage Docker builds?


I'm working out a way to do Docker layer caching in CircleCI, and I've got a working solution. However, I am trying to improve it. The problem in any form of CI is that the image history is wiped for every build, so one needs to work out what files to restore, using the CI system's caching directives, and then what to load back into Docker.

First I tried this, inspired by this approach on Travis. To restore:

if [ -f /caches/${CIRCLE_PROJECT_REPONAME}.tar.gz ]; then gunzip -c /caches/${CIRCLE_PROJECT_REPONAME}.tar.gz | docker load; docker images; fi

And to create:

docker save $(docker history -q ${CIRCLE_PROJECT_REPONAME}:latest | grep -v '<missing>') | gzip > /caches/${CIRCLE_PROJECT_REPONAME}.tar.gz

This seemed to work OK, but my Dockerfile uses a two-stage build, and as soon as I COPYed files from the first to the final, it stopped referencing the cache. I assume this is because (a) docker history only applies to the final build, and (b) the non-cached changes in the first build stage have a new mtime, and so when they are copied to the final stage, they are regarded as new.

To get around this problem, I decided to try saving all images to the cache:

docker save $(docker images -a -q) | gzip > /caches/${CIRCLE_PROJECT_REPONAME}.tar.gz

This worked! However, it has a new problem: when I modify my Dockerfile, the old image cache will be loaded, new images will be added, and then everything will be stored in the cache. This will accumulate dead layers I will never need again, presumably until the CI provider's cache size limits are hit.

I think this can be fixed by caching all the stages of the build, but I am not sure how to reference the first stage. Is there a command I can run, similar to docker history -q -a, that will give me the hashes either for all non-last stages (since I can do the last one already) or for all stages including the last stage?

I was hoping docker build -q might do that, but it only prints the final hash, not all intermediate hashes.

Update

I have an inelegant solution, which does work, but there is surely a better way than this! I search the output of docker build for --->, which is Docker's way of announcing layer hashes and cache information. I strip out cache messages and arrows, leaving just the complete build layer hash list for all build stages:

docker build -t imagename . | grep '\-\-\->' | grep -v 'Using cache' | sed -e 's/[ >-]//g'

(I actually do the build twice - once for the build CI step proper, and a second time to gather the hashes. I could do it just once, but it feels nice to have the actual build in a separate step. The second build will always be cached, and will only take a few seconds to run).

Can this be improved upon, perhaps using Docker commands?


Solution

  • This is a summary of a conversation in the comments.

    One option is to push all build stages to a remote. If there are two build stages, with the first one being named build and the second one unnamed, then one can do this:

    docker build --target build --tag image-name-build .
    docker build --tag image-name .
    

    One can then push image-name (the final build artifact) and image-name-build (the first stage, which is normally thrown away) to a remote registry.

    When rebuilding images, one can pull both of these onto the fresh CI build machine, and then do:

    docker build --cache-from image-name-build --target build --tag image-name-build .
    docker build --cache-from image-name --tag image-name .
    

    As BMitch says, the --cache-from will indicate that the images can be trusted for the purposes of using them as a local layer cache.

    Comparison

    The temporary solution in the question is good if you have a CI-native cache system to store files in, and you would rather not clutter up your registry with intermediate build stage images that are normally thrown away.

    The --cache-from solution is nice because it is tidier, and uses Docker-native features rather than having to grep build output. It will also be very useful if your CI solution does not provide a file caching system, since it uses a remote registry instead.