huffman-codehuffman-tree

Huffman coding rules and optimization


All sources i've read quote the below process to get Huffman codes:

Let's try this for the following Frequency Map:

A -> 1 B -> 2 C -> 3 D -> 4

Making the Tree

        (ABCD, 10)
       /          \
   (ABC, 6)         D:4
   /     \
 (AB, 3)  C:3
 /    \
A:1   B:2

Following this tree to get the codes:

A -> 000 B -> 001 C -> 01 D -> 1

Obviously, the Max depth and the max code length does not follow the logical rule of being log2(nElement), which in this case should be two.

Optimally the codes would be:

A -> 00 B -> 01 C -> 10 D -> 11

So what's goin on here ? what am i doing wrong ?


Solution

  • Imagine encoding String "ABBCCCDDDD" - sample string where each symbol occurs with the given frequency.

    Huffman encoding you came up with encodes it as 0000010010101011111 - 19 characters.

    Your encoding where you optimized for max code length gives 00010110101011111111 - 20 characters.

    Your mistake is thinking that max depth is what's being optimized (or what needs to be optimized) - but Huffman encoding optimizes for encoded text length.

    And as a side note, as I mentioned in a comment, Huffman produces "an" optimal solution, not "the" optimal solution, it's possible I think for a lower-depth solution to be just as good, but it won't be better.