What features in a data set would lead to Huffman coding outperforming Deflate compression?
I am looking at lossless data compression for numerical raster data sets such as terrestrial elevation data. I am seeing a non-trivial number of cases where Huffman coding produces smaller outputs than Deflate compression. I am a bit surprised by this result because, for most data I've looked at in the past, Deflate is better than Huffman by a comfortable margin
I am using the standard Java API for Deflate (level 6) and a home-baked version for Huffman. The techniques I use are described at https://gwlucastrig.github.io/GridfourDocs/notes/GridfourDataCompressionAlgorithms.html The implementation compresses blocks of 11000 bytes and selects either Huffman or Deflate based on results. Huffman was selected in about 7700 cases (36%) and Deflate in about 14000 (64%). The only other clue I have is that the average first-order entropy in the Huffman-selected blocks was 5.83 bits/symbol, and the average in the Deflate-selected blocks was 6.16 bits/symbol.
A bit more detail on the compression test is available at https://gwlucastrig.github.io/GridfourDocs/notes/EntropyMetricForDataCompressionCaseStudies.html#method_selection
The deflate compressed data format cannot be outperformed by Huffman, at least not significantly*, because compressing using only Huffman coding is an option for that format.
Your question is about a specific implementation of deflate. Java uses zlib, which has a strategy for how to make use of the Huffman coding and LZ77 string matching capabilities of the deflate format. For those cases where Huffman coding was better, there were likely few, but some string matches, which resulted in more overhead to describe the codes for the matches than was gained from the matches. The deflate implementation you are using does not try Huffman-only coding to see if that would be smaller. (Though it could. I might add that.)
You can try Huffman-only using deflate to compare with your home-baked version. Search for HUFFMAN_ONLY
here.
*"Huffman coding" is also not a single thing, but encompasses a family of possible implementations. The key variations are a) how to send a representation of the Huffman code before the Huffman-coded data, b) when and how often to make a new Huffman code for some number of bytes of subsequent data, and c) how to know when each set of Huffman-coded data ends. This can result in different performance for different implementations on various data.
For a) there are many different ways to represent a Huffman code, some more compact than others. Using a canonical Huffman code can result in a smaller description, with no change in the number of bits in the coded data.
For b) some data might be homogenous in its statistics, so new codes might only hurt, whereas the data may come in chunks of varying statistics, in which case the compression would benefit from new codes often.
For c) the addition of an "end" symbol is efficient, but in some cases can reduce the compression due to taking up a code space. Prepending a count would avoid that, but then the dedicated count bits might make things worse for small blocks.
I am assuming that by "Huffman coding" here, you mean Huffman coding of the bytes of the data, so there are 256 possible symbols. You could also try coding 65536 symbols, where the previous byte predicts the next byte. Now we're getting into higher-order models of the data.
Update:
Example data was provided. Indeed I get 5868 bytes for raw deflate at the default compression level of 6, 5870 bytes using compression level 9 (!), and 5720 bytes using deflate with Huffman only. The latter is less than the homegrown Huffman compression of 5793 bytes, due to using canonical codes and a more efficient description of the Huffman code which is itself compressed with run-length coding and another Huffman code.
(Note: my 5868 is less than your 5874 due to leaving off the six bytes of zlib header and trailer, for a fair comparison to the Huffman encodings also with no header or trailer.)
In this case there are a lot of matches, but they are all very short, with most of them being a length of only three. Those in particular can result in negative gain on the compression depending on the size of the code description for all those different distances and the impact on the size of literal/length codes with the added match lengths. That is the case here.