compressionprefixhuffman-codeprefix-tree

Compressing text with good spelling, and canonical huffman code


I want to compress text by using words as symbols instead of characters, I don't really know if it's a good idea, but I just want to test it (for science).

The problem is, I can't really store all the words of the english language, so I have gathered a list of very common words (around 1600 words) which I plan to alter just like spellcheckers store derived forms of words. (example: kill, kill-ing, kill-er, kill-s depending if it's a verb, adjective etc)

http://en.wikipedia.org/wiki/Canonical_Huffman_code

I'd like to know if this special version of huffman coding suits my need, since the 'dictionary' won't change often and can be distributed with the decompressing tool. It also seems I would have to dictate frequencies of the words when creating my original huffman tree before turning it into a canonical huffman tree.

Can you correct me if I'm missing a point here, or if it is a good or bad idea ?


Solution

  • The point to note here is that this special variant has the advantage of only smaller codebook and not the compressed data. Thus use it wherever you need to include the huffman codebook along with your data provided the pieces you replace are sequential. Since words naturally can be sequenced in an order - you can, and thus you should - use Canonical Huffman Code.