huffman-codedeflatelz77huffman-tree

Is it possible match initial bytes of Deflate compressed data using knowledge of first portion of uncompressed data?


So my obsession with deflate compression is leading me no where but i feel like there is something i can do the process better.

Here is what i understood so far:

def lz77(uncompressed):
    #find repeated characters and replace it with references
    #output will be like this m0,0 a0,0 l0,0 3,1
    return lz77 encoded data
def huffmanencode(lz_data):
    #encode data
    #put block type identifier and the beginning of every block
    #Put huffman encoded data
    #put end of block or new block identifier

Now lets say the original uncompressed input is:

I am considered a loser in this society, entrapped evilness inside of me now you gimmick you are not what i try to be. Drug deals in back alleys like a demented game of hide and seek Toast an average rape on a bad day, i haven't even begun to ride my peek But moving into adult hood my future is bleak.

And then we deflate it with normal or default parameters of 7zip or Winrar.

Then instead of knowing the whole uncompressed input i know:

I am considered a loser in this society, entrapped evilness inside of me

But i need to obtain few bytes of original compressed data using the part i know.

Here is my assumption regardless of the actual facts:

Since the portion of uncompressed data is at the very beginning then i have a shot against the first step of deflate compression which LZ77. So the output LZ77(original uncompressed) and LZ77(known portion) will share most significant bits. The hardest part if not impossible is Huffman encoding output which is always different in test cases but what if i knew the approximate length of the Huffman tree or the Huffman code used reconstruct the tree upon decompression, Then i can brute force it, if it's less than 4 bytes conclusion: Since the output of LZ77 is identical at the beginning then gaining knowledge of Huffman tree would definitely allow me to output the identical most significant bits.

I have been looking at Pyflate trying to understand extraction of information Huffman encoding that was using when compressing but it's unclear or i misunderstood the code.

Is it possible to get identical most significant bits using only first portion of original input data?

My goal is to utilize this:

MIME-Version: 1.0
Date: Sun, 18 Feb 2024 15:02:54 +0200
From:

And get most significant bits of this:

MIME-Version: 1.0
Date: Sun, 18 Feb 2024 15:02:54 +0200
From: Bluebird <servicing@app.bluebird.com>
Subject: Your monthly statement is available
Thread-Topic: Your monthly statement is available
Message-ID: <20240218130253.16d4687e27ab5f61@app.bluebird.com>
To: "tfanklin0@gmail.com" <tfanklin0@gmail.com

If not possible the i will need to know the size of the Huffman code or tree for the above.


Solution

  • No.

    The two different headers will represent two completely different codes. Even knowing the lengths of the headers and knowing the literal and length/distance LZ77 representation of the data, the bits for those two after the headers will be different because the codes are different.