zipgzipdeflatelz77pkzip

Is the DEFLATE compression algorithm in ZIP and GZip formats based on LZ77 or LZSS?


Wikipedia states that the DEFLATE algorithm (used among others by the ZIP and the GZip compression formats, but also by the PNG image format) is based on the LZ77 algorithm :

Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding.

LZSS' (a derivative of LZ77) Wikipedia page on the other hand states that it's the main algorithm for various tools, including PKZip (which uses the ZIP format, which uses the DEFLATE algorithm) :

Many popular archivers like PKZip, ARJ, RAR, ZOO, LHarc use LZSS rather than LZ77 as the primary compression algorithm.

Similarly, this GitHub page states that :

LZSS algorithm can be used in many applications, such as gzip, PKZIP etc.

Like ZIP, the GZip file format also uses the DEFLATE algorithm.

And here they state that :

DEFLATE (...) combines an LZ77 or LZSS preprocessor with Huffman coding on the backend to achieve moderately compressed results in a short time.

So, if I understand correctly, DEFLATE can be based on either LZ77 or LZSS. How is DEFLATE implemented today in the ZIP file format (used by PKZip) and the GZip file format ? Is it based on LZ77 or on LZSS ?


Solution

  • LZ77 + Huffman. It's always been LZ77 in the generic sense. Wikipedia is wrong.

    Deflate doesn't use LZ77 exactly as presented in the original paper, in that LZ77 always follows a match with a literal, whereas Deflate can follow a match with another match or a literal. This is still referred to as LZ77.

    LZSS is a specific encoding of that variant on LZ77 that uses a one-bit prefix to decide on a literal vs. match. Deflate does not do that, instead using a single Huffman code for literals and lengths, where that is followed by a Huffman code for a distance if the previous code was a length.