We have a code generator that takes random seed as an input. If no seed specified, it will randomly pick a random seed, which means the outcome is not deterministic:
# generated_code1.h and generated_code2.h are almost always different
my-code-gen -o generated_code1.h
my-code-gen -o generated_code2.h
On the other hand,
# generated_code3.h and generated_code4.h are always the same
my-code-gen --seed 1234 -o generated_code3.h
my-code-gen --seed 1234 -o generated_code4.h
Our first attempt to create a target for the generated code was:
genrule(
name = "generated_code",
srcs = [],
outs = ["generated_code.h"],
cmd = "my-code-gen -o $@", # Notice that seed not specified
)
However, we think this breaks the hermeticity of targets depending on :generated_code
.
So we ended up implementing a customized rule and use build_setting
(i.e. configuration) to configure the seed for the invocation of my-code-gen
.
This makes it possible to specify the seed from CLI to any targets that depends on the generated code, e.g.
bazel build :generated_code --//:code-gen-seed=1234
bazel build :binary --//:code-gen-seed=1234
My questions are:
genrule
definition above, it is calling my-code-gen
without --seed
which results in non-deterministic output. Does that mean non-hermetic? What is the cost of breaking hermeticity? (e.g. what trouble would it cause in the future?)--action_env
as an alternative to build_setting
, which also allow us to pass a seed value from CLI to my-code-gen
. Compared to build_setting
, what is the preferred approach in our case?Yes, it's non-hermetic. To be more precise, this is non-determinism, which is a symptom of a non-hermetic build, because the PRNG isn't seeded with a statically known value to the build system. A common other cause of non-determinism is embedding timestamps in build outputs.
Bazel defines hermeticity as:
When given the same input source code and product configuration, a hermetic build system always returns the same output by isolating the build from changes to the host system.
In order to isolate the build, hermetic builds are insensitive to libraries and other software installed on the local or remote host machine. They depend on specific versions of build tools, such as compilers, and dependencies, such as libraries. This makes the build process self-contained as it doesn't rely on services external to the build environment.
The biggest problem is breaking cacheability of everything that depends on the genrule, because you can no longer trust/guarantee that given a cache key (i.e. hashes of the genrule's inputs, command, environment), the output will be identical and reproducible across build invocations.
This has costs ranging from
The //:code-gen-seed
build setting only affects targets that depend on it, but --action_env
affects every action. Changes to the build setting would only invalidate the minimal set of targets, and causing minimal re-analysis, cache lookups, and rebuilds, and is thus preferred. You can experiment with this by comparing incremental build speeds with more targets that don't depend on //:code-gen-seed
.