compressionhuffman-codegame-boy-advancelz77

Why to combine Huffman and lz77?


I'm doing a reverse engineering in a Gameboy Advance's game, and I noticed that the originals developers wrote a code that has two system calls to uncompress a level using Huffman and lz77 (in this order).

But why to use Huffman + lzZ7? Whats the advantage to this approach?


Solution

  • Using available libraries

    It's possible that the developers are using DEFLATE (or some similar algorithm), simply to be able to re-use tested and debugged software rather than writing something new from scratch (and taking who-knows-how-long to test and fix all the quirky edge cases).

    Why both Huffman and LZ77?

    But why does DEFLATE, Zstandard, LZHAM, LZHUF, LZH, etc. use both Huffman and LZ77?

    Because these 2 algorithms detect and remove 2 different kinds of redundancy in common to many data files (video game levels, English and other natural-language text, etc.), and they can be combined to get better net compression than either one alone. (Unfortunately, most data compression algorithms cannot be combined like this).

    details

    In English, the 2 most common letters are (usually) 'e' and then 't'. So what is the most common pair? You might guess "ee", "et", or "te" -- nope, it's "th".

    LZ77 is good at detecting and compressing these kinds of common words and syllables that occur far more often than you might guess from the letter frequencies alone.

    Letter-oriented Huffman is good at detecting and compressing files using the letter frequencies alone, but it cannot detect correlations between consecutive letters (common words and syllables).

    LZ77 compresses an original file into an intermediate sequence of literal letters and "copy items". Then Huffman further compresses that intermediate sequence. Often those "copy items" are already much shorter than the original substring would have been if we had skipped the LZ77 step and simply Huffman compressed the original file. And Huffman does just as well compressing the literal letters in the intermediate sequence as it would have done compressing those same letters in the original file.

    So already this 2-step process creates smaller files than either algorithm alone. As a bonus, typically the copy items are also Huffman compressed for more savings in storage space.

    In general, most data compression software is made up of these 2 parts. First they run the original data through a "transformation" or multiple transformations, also called "decorrelators", typically highly tuned to the particular kind of redundancy in the particular kind of data being compressed (JPEG's DCT transform, MPEG's motion-compensation, etc.) or tuned to the limitations of human perception (MP3's auditory masking, etc.). Next they run the intermediate data through a single "entropy coder" (arithmetic coding, or Huffman coding, or asymmetric numeral system coding) that's pretty much the same for every kind of data.