gziphuffman-codelz77

How to extract the encoding dictionary from gzip archives


I am looking for a method whereby I can extract the encoding dictionary made by DEFLATE algorithm from a gzip archive.

I need the LZ77 made pointers from the whole archive which refer to patterns from the file as well as the Huffman tree with the aforementioned pointers.

Is there any solution in python?

Does anyone know the https://github.com/madler/infgen/blob/master/infgen.c which might provide the dictionary?


Solution

  • The "dictionary" used for compression at any point in the input is nothing more than the 32K bytes of uncompressed data that precede that point.

    Yes, infgen will disassemble a deflate stream, showing all of the LZ77 references and the derived Huffman codes in a readable form. You could run infgen from Python and interpret the output in Python.

    infgen also has a -b option for a non-human-readable binary format that might be faster to process for what you want to do.