I have been studying the LZSS compression scheme. I believe that I have just about got the hang of it, but I have some problems when it comes to actually encoding the data into actual bits and bytes. Specifically, I know that literal bytes are preceded by a 1 bit, while (offset, length) pairs are preceded by a 0 bit (and yes, I know that this can also be the other way around). However, encoding the data with these bit flags can create a situation where the compressed data ends up with a number of bits that is not a multiple of 8. How does the algorithm compensate for this? Does the algorithm simply append an appropriate number of 0 bits to the end of the data to make sure that the number of bits is a proper multiple of 8? Even if you did that, this would have deceptive effects on a would-be decompressor. Suppose that you had to add three 0's to the end of the data when you were compressing. In that case, the decompressor would interpret the first of those three 0's as a flag bit (perhaps indicating that the following two bytes are an offset length pair, or if not, then a literal byte). The decompressor would then get very confused to find that only two bits remain in the data stream, and that there are no more full bytes to interpret as a literal or as a pair. How do you deal with this?
I have seen the idea of chaining together 8 bits into a full byte (let's call this the flag byte) to act as a set of 8 flags for the next 8 entries. However, this idea pretty much runs into the same problem, doesn't it? What happens when there are less than 8 entries left to encode? (By 'entries', I am referring to literal bytes and/or offset length pairs). If there are only 5 entries left for example, you can't chain together 5 bits and call that a byte. What would you do in this case?
Another idea that I have seen in a few sources is the idea of keeping the (offset, length) pairs at 16 bits by encoding the offset in 12 bits and the length in 4 bits. However, that would limit the length to 15 since that is the largest number that can be encoded via 4 binary bits. How would this work then, with data that has encoded patterns that are longer than 15 characters? Additionally, that still does not take care of the problem of the flag bits.
Can someone please help me understand just how lzss compressed data is encoded? I get the theory behind lzss, but the actual encoding into bits and bytes seems strange to me.
To summarize, here is my current understanding of encoding the data in lzss:
If you come across a literal byte, then you just prepend a 1 (or 0) bit flag and then just output the byte as is.
If you come across an offset length pair, then you prepend a 0 bit (or a 1), and then output two bytes (first the offset, then the length). Example: If I come across the offset length pair (4, 7), then my encoded data here should look like this: 0(bit) 0x04 0x07 In bits, this translates to the following 17 bits: 00000010000000111
Is my understanding correct, or am I missing something?
If a literal byte is preceded by a 1 bit, then you can simply fill the remaining bits in the last byte with 1 bits. There can't be eight bits following it, so you know you're done. It cannot be misinterpreted as either an offset/length or a literal, even if you have an encoding that represents offset/lengths in a small number of bits.
If you would like for the compressed data to be self-terminating, because you will be following it by something else, then you can define a special offset/length combination that cannot be used to mean that offset and length, but instead signals the end of the compressed data.