encodinghuffman-codeinformation-theory

How can I find the average length of a codeword encoded in Huffman if there are N(10 or more) symbols?


I'm practicing for an exam and I found a problem which asks to find the average length of codewords which are encoded in Huffman.

This usually wouldn't be hard, but in this problem we have to encode 100 symbols which all have the same probability (1/100).

Since there is obviously no point in trying to encode 100 symbols by hand I was wondering if there is a method to find out the average length without actually going through the process of encoding.

I'm guessing this is possible since all the probabilities are equal, however I couldn't find anything online.

Any help is appreciated!


Solution

  • For 100 symbols with equal probability, some will be encoded with six bits, some with seven bits. A Huffman code is a complete prefix code. "Complete" means that all possible bits patterns are used.

    Let's say that i codes are six bits long and j codes are seven bits long. We know that i + j = 100. There are 64 possible six-bit codes, so after i get used up, there are 64 - i left. Adding one bit to each of those to make them seven bits long doubles the number of possible codes. So now we can have up to 2(64 - i) seven-bit codes.

    For the code to be complete, all of those codes must be used, so j = 2(64 - i). We now have two equations in two unknowns. We get i = 28 and j = 72.

    Since all symbols are equally probable, the average number of bits used per symbol is (28x6 + 72x7) / 100, which is 6.72. Not too bad, considering the entropy of each symbol is 6.64 bits.