zipdeflatelossless-compressionhuffman-tree

How does a decompressor know the huffman tree that was used by the compressor?


From what i understood Deflate is LZ77 and Huffman encoding but i have been looking at the structure of .zip file but i haven't seen where to extract information about Huffman tree that was used in compression. I suspect it might be stored in the compressed data but where exactly in it, its not specified anywhere.

My goal is to see if can do Deflate compression and decompression manually to better understand it before i start my python programming.

And frankly my knowledge is short on block sizes and Huffman tree, so that's where i want to try to understand.


Solution

  • Yes, the Huffman codes are stored at the start of every dynamic block in deflate compressed data. There may be many dynamic blocks, so many Huffman trees in one deflate compressed stream. Each dynamic block has three Huffman codes, one for literal/length codes, one for distance codes, and third that is used to compress the descriptions of the first two codes. Each Huffman code is represented not as a tree, but rather as the number of bits in the prefix code for each coded symbol.

    It is most definitely "specified anywhere". See RFC 1951 for a detailed description, and infgen for a deflate stream disassembler which will allow you to view the detailed contents of a deflate stream.