docker environment-variables docker-registry reproducible-research docker-tag

How can I always pull my latest docker image but still deterministically record its composition for future reproducibility?

I'm doing analytical work inside a "Lab" docker environment which I manage. I use Travis to build, tag and publish the lab image to a docker container registry (AWS ECR) and then always pull latest image when I start the container to do my analytical work. This ensures I'm always working inside the latest version of the Lab environment. Note: each time Travis publishes a new image, it tags it in ECR with the build git commit ID and latest.

For reproducibility of my analytical results, I would like my python code running inside the container to be able to record in its outputs an identifier that indicates the exact docker image being used. This would enable me to re-download that particular docker image many months/years later from ECR and/or find the git commit from which the docker image was built, run the code again, and (hopefully!) get the same results.

What is the most standard way of achieving this? Can I perhaps store the image digest as an environment variable inside the container?

Solution

There's probably a couple of options, but it depends on how the image is built

Assuming the source code is cloned in CI, and from that source the image is built (so you're not cloning the source code in the Dockerfile), you can use a build-arg to "bake" that commit in the image as an environment variable;

In your Dockerfile, define a build-arg (ARG), and assign its value to an environment variable (ENV). It's needed to assign it to an ENV, because build-args (by design) are not persisted in the image itself (only available during build).

For example:

FROM busybox:latest
ARG GIT_COMMIT=HEAD
ENV GIT_COMMIT=${GIT_COMMIT}

I'm setting a default value, so that the variable contains something "useful" if the Dockerfile is built without passing a build-arg

Then, when building the image, pass the git commit as a build arg

git clone https://github.com/me/my-repo.git && cd my-repo

export GIT_COMMIT=$(git rev-parse --short --verify HEAD)

docker build -t lab:${GIT_COMMIT} --build-arg GIT_COMMIT=${GIT_COMMIT} .

When running the image, the GIT_COMMIT is available as environment variable.

If you want to pass a reference at runtime (when running the image) instead, you can pass a reference when running the image; for example, to pass the digest of the image that you're running;

docker pull lab:latest

export IMAGE_DIGEST=$(docker inspect --format '{{ (index .RepoDigests 0) }}' lab:latest)

docker run -it --rm -e IMAGE_DIGEST=${IMAGE_DIGEST} lab:latest