algorithmcompressiontext-compression

What is the name of this text compression scheme?


A couple years ago I read about a very lightweight text compression algorithm, and now I can't find a reference or remember its name.

It used the difference between each successive pair of characters. Since, for example, a lowercase letter predicts that the next character will also be a lowercase letter, the differences tend to be small. (It might have thrown out the low-order bits of the preceding character before subtracting; I cannot recall.) Instant complexity reduction. And it's Unicode friendly.

Of course there were a few bells and whistles, and the details of producing a bitstream, but it was super lightweight and suitable for embedded systems. No hefty dictionary to store. I'm pretty sure that the summary I saw was on Wikipedia, but I cannot find anything.

I recall that it was invented at Google, but it was not Snappy.


Solution

  • I think what you're on about is BOCU, Binary-Ordered Compression for Unicode or one of its predecessors/successors. In particular,

    The basic structure of BOCU is simple. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.