compressionbinary-datahuffman-codelzwlossless

Lossless compression theory, is compression ratio based on size of pattern and times repeated?


I was wondering which of the following scenarios will achieve the highest ratio with lossless algorithms applied to binary data with repeated data.

Am I correct to assume the compression ratio depends on patterns?

  1. Size
  2. Times repeated

For example the binary data:

10 10 10 10 10 10 10 10 pattern (10) size 2, pattern (10) repeated 8

1001 1001 1001 1001 pattern (1001) size 4, pattern (1001) repeated 4

0000000 11111111 pattern (0) size 1, pattern (0) repeated 8; pattern (1) size 1, pattern (1) repeated 8; Or 0000000 11111111 pattern (0000000) size 8, pattern (0000000) repeated 8; pattern (11111111) size 8, pattern (11111111) repeated 1;

Which of the above achieves the highest and lowest compression ratios?

Thanks in advance.


Solution

  • Those are all sequences that would be very unlikely to be seen in the wild. What is the point of the question?

    Run-of-the-mill compressors are byte-oriented. As such, any pattern that results in simply the same byte repeated will give the highest compression ratio. E.g. 1032:1 in the limit for deflate. Other simple repetitions of short patterns will get very high compression ratios. E.g. again 1032:1 for deflate for patterns of two or three repeating bytes.

    The limit on compression in these absurdly extreme cases is a function of the compression format, not of the data.