ziphuffman-codedeflate

Can initial similarity between 2 files create similarity in the start of their Deflate compressed output?


Assuming we have 2 files:

FILE 1

FILE 2

The 2 files share 20 initial bytes in uncompressed data. What are the chances that the compressed data after the zip headers share 3 initial bytes.

I have came across 5 files with the above characteristics and 4 of them had:

14 9A C7

and the other:

1C 9A C7

And i am confused of this.

If the above case is true then given the characteristics above one can easily guess the first bytes of the compressed file up 5 initial bytes.


Solution

  • It appears that all your compression is managing to do is to mostly undo the expansion of the Base64 encoding. Instead of compression, you should simply unencode the Base64 back to the original binary. Then you could try to compress that, though it appears the original binary may not be very compressible.

    The header of a dynamic block depends on the entire uncompressed content of that block, not just the first 20 bytes or whatever of that content. You can think of the header as a summary of the statistics of that block's uncompressed data and string matches. Furthermore, those initial three bytes are a statistical summary of a statistical summary, as they represent the coding of the header, which is itself compressed. Those first three bytes are quite far removed from the information contained in the first 20 bytes of the uncompressed data.

    There are constraints on those first three bytes that are independent of the uncompressed content. You can look at RFC 1951 for what the first 17 bits of a dynamic block represent. If you can depend on the fact that your input data is a Base64 encoding, then there will be probability distribution over a limited set of possible values of those three bytes, since the range of content bytes is limited, and especially since, other than the encoding, your data appears to be incompressible and so somewhat evenly distributed over those values.

    For kicks, I ran one million trials of Base-64 encoding of random binary data, to see the distribution of the first three bytes. I saw about 200 different such sequences, and I'm sure I would have seen more if I ran more trials. Here are the most common sequences, preceded by their count out of a million, down to a 0.1% probability. Your two sequences are in here, but are far from the most common.

    177522 14 9b c5
    174308 14 9a c5
    80347 14 9a 45
    69894 14 9b 45
    65311 14 9a b5
    60991 14 9b b5
    29937 14 99 c5
    28529 1c 9b c5
    25319 14 9a 35
    25185 1c 9a c5
    22425 14 9b 35
    19393 14 9b c7
    17294 14 99 45
    16236 14 9b 47
    13488 14 9a c7
    13138 14 99 b5
    12809 14 9b b7
    11916 14 9a 47
    9645 14 98 c5
    8543 14 9a b7
    8425 1c 9a b5
    8321 1c 9b b5
    7692 1c 9a 45
    7650 1c 9b 45
    6841 14 98 45
    6663 14 99 35
    5617 1c 9b c7
    5176 14 9b 37
    5013 14 98 b5
    4336 14 9a 37
    3670 1c 99 c5
    3561 1c 9a c7
    3169 14 98 35
    3021 1c 9b b7
    2524 14 97 c5
    2458 14 97 45
    2368 14 9c c5
    1980 1c 9a b7
    1875 14 97 b5
    1831 1c 9b 47
    1553 14 99 47
    1500 14 99 c7
    1487 1c 9a 35
    1443 1c 99 45
    1416 1c 99 b5
    1401 1c 9a 47
    1373 14 97 35
    1316 1c 9b 35
    1012 14 99 b7
    1002 1c 98 c5
    ...