linuxlinkerelf

ELF, Build-ID, is there a utility to recompute it?


I came across this useful feature in ELF binaries -- Build ID. "It ... is (normally) the SHA1 hash over all code sections in the ELF image." One can read it with GNU utility:

$ readelf -n /bin/bash
...
Displaying notes found at file offset 0x00000274 with length 0x00000024:
  Owner                 Data size   Description
  GNU                  0x00000014   NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 54967822da027467f21e65a1eac7576dec7dd821

And I wonder if there is an easy way to recompute Build ID yourself? To check if it isn't corrupted etc.


Solution

  • So, I've got an answer from Mark. Since it is an up to date info, I post it here. But basically you guys are right. Indeed there is no tool for computing Build-ID, and the intentions of Build-ID are not (1) identification of the file contents, and not even (2) identification of the executable (code) part of it, but it is for (3) capturing "semantic meaning" of a build, which is the hard bit for formalization. (Numbers are for self-reference.)

    Quote from the email:

    -- "Is there a user tool recomputing the build-id from the file itself, to check if it's not corrupted/compromised somehow etc?" If you have time, maybe you could post an answer there?

    Sorry, I don't have a stackoverflow account. But the answer is: No, there is no such tool because the precise way a build-id is calculated isn't specified. It just has to be universally unique. Even the precise length of the build-id isn't specified. There are various ways using different hashing algorithms a build-id could be calculated to get a universally unique value. And not all data might (still be) in the ELF file to recalculate it even if you knew how it was created originally.

    Apparently, the intentions of Build-ID changed since the Fedora Feature page was written about it. And people's opinions diverge on what it is now. Maybe in your answer you could include status of Build-ID and what it is now as well?

    I think things weren't very precisely formulated. If a tool changes the build that creates the ELF file so that it isn't a "semantically identical" binary anymore then it should get a new (recalculated) build-id. But if a tool changes something about the file that still results in a "semantically identical" binary then the build-id stays the same.

    What isn't precisely defined is what "semantically identical binary" means. The intention is that it captures everything that a build was made from. So if the source files used to generate a binary are different then you expect different build-ids, even if the binary code produced might happen to be the same.

    This is why when calculating the build-id of a file through a hash algorithm you use not just the (allocated) code sections, but also the debuginfo sections (which will contain references to the source file names).

    But if you then for example strip the debuginfo out (and put it into a separate file) then that doesn't change the build-id (the file was still created from the same build).

    This is also why, even if you knew the precise hashing algorithm used to calculate the build-id, you might not be able to recalculate the build-id. Because you might be missing some of the original data used in the hashing algorithm to calculate the build-id.

    Feel free to share this answer with others.

    Cheers,

    Mark

    Also, for people interested in debuginfo (linux performance & tracing, anyone?), he mentioned a couple projects for managing them on Fedora: