algorithmcompressioncomplexity-theoryhuffman-code

How to find the set of binary symbols that can most efficiently compress using Huffman coding?


In my current implementation of file compression using Huffman coding, I take the frequency of each byte and build the tree from there.

I'm thinking that there is a possibility for further compression if I don't limit the program to only counting the frequencies of bytes but of binary symbols of any length.

For example, in a text file, if there was a "q" and the next byte was always a "u". Instead of using "q" and "u", it would be better to have "qu" and "u".

But instead of just concatenating bytes, any sort of binary symbol of any length.

I have thought to generate all possible messages, then somehow choose a subset of symbols that don't overlap using some sort of Mealy machine implementation and generate frequencies, but I'm at a loss.


Solution

  • This is a problem researched by Kolmogorov, Solomonoff, Chaitin, and others in 1960 - 1970. Wikipedia gives a good overview, you can find the original papers from there. Kolmogorov complexity deals with the idea of finding the shortest program length that can reproduce a given arbitrary input string. The term "arbitrary" means any input, including infinitely many and random input.

    One key point in the study of Kolmogorov complexity is that it's fundamentally impossible to write a single, generic program that can compress arbitrary input strings to a size smaller than the input string itself plus some bytes. This limitation comes from the very nature of arbitrary - random noise data. Random data cannot be compressed. However, if you have specific knowledge about the internal structure or patterns within the data, you may be able to design a more efficient algorithm than the generic lossless compression or provide better arguments to the existing algorithm.

    As a practical approach, you can experiment with different general-purpose lossless compression algorithms and check their performance on your data. Try varying compression levels and dictionary sizes. Better compression would usually impact the time it takes to compress data.

    In some cases, if you have many short strings with repetitive data, you could try training compressor and use an already pre-computed dictionary to make the compression run faster with smaller output.

    From comments I noticed this may be related to DNA sequence compression - this is a separate active area of research, and there are many papers available on the topic. I would start from there.