In my efforts to write some compressor/decompressor code for some .lzs files for a mobile game, I came across the documentation for how Final Fantasy 7 (FF7) handles LZSS compression and decompression. Here is the documentation:
https://github.com/niemasd/PyFF7/wiki/LZSS-Format
Now, I mostly understand this. However, that calculation for the offset is really throwing me off. This manner of LZSS compression uses a 12 bit offset for length/offset pairs that is calculated as follows:
real_offset = tail - ((tail - 18 - raw_offset) mod 4096)
where the tail is your current position in the decompressed output (assuming you are decompressing), and raw_offset is the 12 bit string composed of the first 4 bits of the second byte in the pair followed by all 8 bits of the first byte of the pair.
The problem that I am having is: I am trying to write some compressor code, but I don't know what that raw_offset is supposed to represent. Here's what I mean by that:
Normally, in LZSS compression, the offset is the number of characters you have to go backwards from your current position in the decompressed output to arrive at the first character of the pattern that you are repeating. For example, if I have the string:
ABCDEFABCGHIJ
This would be compressed like:
ABCDEF(6,3)GHIJ
Of course, the 6 would be the offset because the first instance of 'ABC' occurs 6 characters prior to the second instance of 'ABC'.
However, with this FF7 implementation of LZSS, I don't even understand at a high level what this raw_offset is supposed to be (let alone the real_offset). Whereas the normal LZSS offset (e.g. the 6 in the pair above) is just the distance between your current position and the first character of the most recent iteration of the pattern that you are copying, I don't know what this raw_offset is.
Could someone please explain to me what this raw_offset is in the FF7 implementation of LZSS compression? I can decompress, but I can't write compressor code without understanding what that raw offset physically represents.
To illustrate just how much of a problem this is for me: I actually wrote some decompressor code for this implementation, and I tried to decompress a certain file, and I initially ended up getting an IndexOutOfBounds error because the code had only written 12 bytes to my decompressed output thus far, but the real_offset ended up being something like 3602. Remember: A 12 bit offset with any 1 bits in the higher nibble makes for a very large number.
Additionally, does anyone know where I could access a FF7 file that is compressed in this manner so that I can test my decompression code? I have some .lzs files, but they are not from FF7 and I am not entirely sure if the game that my files are from actually use this FF7 implementation.
Lastly, what do you do in the event that there are less than the necessary number of bytes remaining to accommodate all 8 bits of the final control byte in the file? For example, let's say you get to the last control byte and it is 0x00. This would imply that 8 length/offset pairs (or 16 bytes) are to follow. However, what if there are only two bytes remaining after this control byte, thus leaving 7 bits of the control byte unaccounted for. Would you just print the one last pair as normal and then print 14 null (0x00) bytes? Alternatively, would you just print the one last pair as normal and then only print 7 null (0x00) bytes? Perhaps you would just print the pair and just leave it at that (not printing any null bytes)?
There is no "physical meaning", as everything being discussed here is data in software. Until, I suppose, you get down to the electrons, wires, and transistors.
The linked documentation explains this all superbly well with explicit examples, but ok, here's another shot.
TL;DR: The raw offset refers to a circular buffer. The true offset refers to a linear buffer.
The "raw offset" is simply the offset in a circular 4096-byte buffer. You initialize that buffer with zeros. You write uncompressed data to that buffer starting 18 bytes back from the end. When you run into the end of the buffer either writing or copying a reference, you wrap around to the beginning. When you get a reference, you copy from the buffer starting at that offset.
If instead you use a linear buffer that is large enough to hold all of the resulting uncompressed data, then you would use the provided formula to get the "real" offset into that buffer to start copying from. They did not have to provide that formula — you can derive it from the description of the raw offset. For a linear buffer you would want to precede it with 4096 bytes of zeros in order to support references before the compressed data. Or you have code to detect such a reference and write the requisite number of zeros instead of copying from the buffer.
I initially ended up getting an IndexOutOfBounds error because the code had only written 12 bytes to my decompressed output thus far, but the real_offset ended up being something like 3602.
That, or any offset that precedes the compressed data, is not an error. The document specifies that any such reference should copy zeros.
The first four bytes provides the length of the compressed data. Once you have processed all of the compressed data, you're done. It is permitted to have missing literal bytes. How else would arbitrary-length input, e.g. one byte, be encodable? If the compressed data ends in the middle of a reference however, i.e. you only have one of the two bytes, then that should be reported as an error.