algorithmcompressionhuffman-codelossless-compressionlossless

Is there mathematical proof that Huffman coding is the most efficient lossless compression algorithm?


My friend told me it existed but I could never find it, not sure if he was lying but I'm very interested as to how the proof works. (Yes, I'm one of those people who found out about Huffman coding from the Silicon Valley TV show, sorry)


Solution

  • The answer is it is, it isn't, and the question is ill-posed. :-)

    Here is a high level view. Lossless compression algorithms provide a reversible mapping of possible documents to be compressed, to compressed documents. Documents can be viewed as strings of bits. There are 2^n possible documents with n bits. There are 2^n possible compressed documents with n bits. Therefore the pidgin-hole principle says that for every document that is stored more efficiently, some other possible document must be stored less efficiently.

    So how is compression possible? It is possible because while all documents are possible, they are not equally likely. So a good compression algorithm will store likely documents very efficiently, and unlikely ones inefficiently. But then the question is what documents are efficient. The answer to that is, "It depends." And the answer to how good a compression algorithm is will also depend.

    Suppose that you take the set of random documents made out of a set of symbols that independently appear with different probabilities. Huffman coding produces the most efficient possible compression algorithm.

    Now suppose you take the set of random sentences that are likely to be written in English? Huffman coding is limited to looking at raw letter frequencies. It makes no use of the fact that certain combinations of letters appear very frequently. Other encodings that can use that will now work better.

    Now suppose you take the set of documents that could be produced by your camera. This looks nothing like text, and different encoding methods will work better.

    So there are cases where Huffman is best. Cases where it isn't. And the question is ill-posed since it depends on, "What documents are likely?"