Can initial similarity between 2 files create similarity in the start of their Deflate compressed output?

Assuming we have 2 files:

FILE 1

Uncompressed size: 793765 bytes
Compressed size: 604911 bytes
Data type: 99% Base64
Compression method : deflate dynamic

FILE 2

Uncompressed size: 793765 bytes
Compressed size: 604911 bytes
Data type: 99% Base64
Compression method : deflate dynamic

The 2 files share 20 initial bytes in uncompressed data. What are the chances that the compressed data after the zip headers share 3 initial bytes.

I have came across 5 files with the above characteristics and 4 of them had:

14 9A C7

and the other:

1C 9A C7

And i am confused of this.

If the above case is true then given the characteristics above one can easily guess the first bytes of the compressed file up 5 initial bytes.

Solution

It appears that all your compression is managing to do is to mostly undo the expansion of the Base64 encoding. Instead of compression, you should simply unencode the Base64 back to the original binary. Then you could try to compress that, though it appears the original binary may not be very compressible.

The header of a dynamic block depends on the entire uncompressed content of that block, not just the first 20 bytes or whatever of that content. You can think of the header as a summary of the statistics of that block's uncompressed data and string matches. Furthermore, those initial three bytes are a statistical summary of a statistical summary, as they represent the coding of the header, which is itself compressed. Those first three bytes are quite far removed from the information contained in the first 20 bytes of the uncompressed data.

There are constraints on those first three bytes that are independent of the uncompressed content. You can look at RFC 1951 for what the first 17 bits of a dynamic block represent. If you can depend on the fact that your input data is a Base64 encoding, then there will be probability distribution over a limited set of possible values of those three bytes, since the range of content bytes is limited, and especially since, other than the encoding, your data appears to be incompressible and so somewhat evenly distributed over those values.

For kicks, I ran one million trials of Base-64 encoding of random binary data, to see the distribution of the first three bytes. I saw about 200 different such sequences, and I'm sure I would have seen more if I ran more trials. Here are the most common sequences, preceded by their count out of a million, down to a 0.1% probability. Your two sequences are in here, but are far from the most common.

177522 14 9b c5
174308 14 9a c5
80347 14 9a 45
69894 14 9b 45
65311 14 9a b5
60991 14 9b b5
29937 14 99 c5
28529 1c 9b c5
25319 14 9a 35
25185 1c 9a c5
22425 14 9b 35
19393 14 9b c7
17294 14 99 45
16236 14 9b 47
13488 14 9a c7
13138 14 99 b5
12809 14 9b b7
11916 14 9a 47
9645 14 98 c5
8543 14 9a b7
8425 1c 9a b5
8321 1c 9b b5
7692 1c 9a 45
7650 1c 9b 45
6841 14 98 45
6663 14 99 35
5617 1c 9b c7
5176 14 9b 37
5013 14 98 b5
4336 14 9a 37
3670 1c 99 c5
3561 1c 9a c7
3169 14 98 35
3021 1c 9b b7
2524 14 97 c5
2458 14 97 45
2368 14 9c c5
1980 1c 9a b7
1875 14 97 b5
1831 1c 9b 47
1553 14 99 47
1500 14 99 c7
1487 1c 9a 35
1443 1c 99 45
1416 1c 99 b5
1401 1c 9a 47
1373 14 97 35
1316 1c 9b 35
1012 14 99 b7
1002 1c 98 c5
...