linuxgzipdiffbinaryfilesdelta

Is there a popular Linux/Unix format for binary diffs?


I'm going to be producing binary deltas of multi-gigabyte files.

Naively, I'm intending to use the following format:

struct chunk {
    uint64_t offset;
    uint64_t length;
    uint8_t data[];
};

struct delta {
    uint8_t file_a_checksum[32]; // These are calculated while the
    uint8_t file_b_checksum[32]; // gzipped chunks are being written
    uint8_t chunks_checksum[32]; // at the 96 octet offset.
    uint8_t gzipped_chunks[];
};

I only need to apply these deltas to the original file_a that was used to generate a delta.

Is there anything I'm missing here?

Is there an existing binary delta format which has the features I'm looking for, yet isn't too much more complex?


Solution

  • For arbitrary binaries, of course it makes sense to use a general purpose tool:

    (Yes, git diff works on files that aren't under version control. git diff --binary --no-index dir1/file.bin dir2/file.bin )

    I would usually recommend a generic tool before writing your own, even if there is a little overhead. While none of the tools in the above list produce binary diffs in a format quite as ubiquitous as the "unified diff" format, they are all "close to" standard tools.

    There is one other fairly standardised format that might be relevant for you: the humble hexdump. The xxd tool dumps binaries into a fairly standard text format by default:

    0000050: 2020 2020 5858 4428 3129 0a0a 0a0a 4e08      XXD(1)....N.
    

    That is, offset followed by a series of byte values. The exact format is flexible and configurable with command-line switches.

    However, xxd can also be used in reverse mode to write those bytes instead of dumping them.

    So if you have a file called patch.hexdump:

    00000aa: bbccdd
    

    Then running xxd -r patch.hexdump my.binary will modify the file my.binary to modify three bytes at offset 0xaa.

    Finally, I should also mention that dd can seek into a binary file and read/write a given number of bytes, so I guess you could use "shell script with dd commands" as your patch format.