algorithmnetwork-programmingdiffbinary-diff

Binary diff algorithm for small binary blobs


There is a lot of information about binary diff algorithms for pretty big files (1MB+). However, my use case is different. This is why it is not a duplicate.

I have a collection of many objects, each in 5-100 byte range. I want to send updates on those objects over the network. I want to compile updates into individual TCP packets (with a reasonable MTU of ~1400). I want to try and fit as much updates in each packet as possible: first add their IDs, and then put the binary diff of all the binary objects, combined.

What is the best binary diff algorithm to be used for such a purpose?


Solution

  • With such small objects, you could use the classic longest-common-subsequence algorithm to create a 'diff'.

    That is not the best way to approach your problem, however. The LCS algorithm is constrained by the requirement to use each original byte only once when matching to the target sequence. You probably don't really need that constraint to encode your compressed packets, so it will result in sub-optimal solution in addition to being kinda slow.

    Your goal is really to use the example of the original objects to compress the new objects, and you should think about it in those terms.

    There are many, many ways, but you probably have some idea about how you want to encode those new objects. Probably you are thinking of replacing sections of the new objects with references to sections of the original objects.

    In that case, it would be practical to make a suffix array for each original object (https://en.wikipedia.org/wiki/Suffix_array). Then when you're encoding the corresponding new object, at each byte you can use the suffix array to find the longest matching segment of the old object. If it's long enough to result in a savings, then you can replace the corresponding bytes with a reference.