I'm going to be producing binary deltas of multi-gigabyte files.
Naively, I'm intending to use the following format:
struct chunk {
uint64_t offset;
uint64_t length;
uint8_t data[];
};
struct delta {
uint8_t file_a_checksum[32]; // These are calculated while the
uint8_t file_b_checksum[32]; // gzipped chunks are being written
uint8_t chunks_checksum[32]; // at the 96 octet offset.
uint8_t gzipped_chunks[];
};
I only need to apply these deltas to the original file_a
that was used to generate a delta.
Is there anything I'm missing here?
Is there an existing binary delta format which has the features I'm looking for, yet isn't too much more complex?
For arbitrary binaries, of course it makes sense to use a general purpose tool:
(Yes, git diff
works on files that aren't under version control. git diff --binary --no-index dir1/file.bin dir2/file.bin
)
I would usually recommend a generic tool before writing your own, even if there is a little overhead. While none of the tools in the above list produce binary diffs in a format quite as ubiquitous as the "unified diff" format, they are all "close to" standard tools.
There is one other fairly standardised format that might be relevant for you: the humble hexdump. The xxd
tool dumps binaries into a fairly standard text format by default:
0000050: 2020 2020 5858 4428 3129 0a0a 0a0a 4e08 XXD(1)....N.
That is, offset followed by a series of byte values. The exact format is flexible and configurable with command-line switches.
However, xxd
can also be used in reverse mode to write those bytes instead of dumping them.
So if you have a file called patch.hexdump
:
00000aa: bbccdd
Then running xxd -r patch.hexdump my.binary
will modify the file my.binary
to modify three bytes at offset 0xaa
.
Finally, I should also mention that dd
can seek into a binary file and read/write a given number of bytes, so I guess you could use "shell script with dd
commands" as your patch format.