loggingdata-structurescompressionbuffer-overflowtext-compression

log module with pre-allocated memory


I am writing a logging mechanism that will be used by the rest of the code to log alphanumeric data to file. Every other module in the system will be sending alphanumeric sentences (a couple of words at max) to be written to file continuously. The catch is, I have only been given a small amount of pre-allocated memory to use for my data structures and in-memory storage of these log messages. If the inflow is more that what can be written to disk, the log messages will be discarded.

I want to put in a compression mechanism between the client and in-memory storage in my log module, so that I can save as many messages as possible.

My current design so far:

CLIENT ------> LOG MODULE ----> compress and store in in-memory buffer 1

Writer thread: When its time to write, switch buffer 1 with buffer 2 and write buffer 1 to file. The client will be writing to buffer 2 during this time.

Script outside: Decompress and show log messages

Question: What is a good alphanumeric compression algorithm I can use or a good data structure I can use to capture as much data as possible (during the compression stage above)?

If possible, I would like an algorithm that doesn't store compression code in an intermediate data structure - i.e., if the system crashes, I want to be able to decompress whatever has been written to file so far.

Attempt so far: assign a code to every charecter we will be using. Doesnt seem so flexible.

Most of the log messages are simple text sentences


Solution

  • Question: What is a good alphanumeric compression algorithm I can use or a good data structure I can use to capture as much data as possible (during the compression stage above)?

    In general the slower and more memory-hungry the algorithm the better the compression ratio. Different codecs make different trade-offs, and even within some codecs you can tweak different parameters to yield different trade-offs.

    Codecs also tend to perform very differently with different data. There are several benchmarks floating around, but that will only give you a general idea of performance; to really choose the best one you'll need to try them out with your data and take your own measurements.

    As for avoiding data loss when your process crashes, with your current design what you want is a streaming codec which supports flushing. Every time you finish logging a message, you'll want to tell the codec to flush. The API for this will depend on the codec, but usually you'll end up with something like

    foo_compress(stream, output, input);
    foo_flush(stream);
    fwrite(stream->output, 1, stream->output_size, file);
    fflush(stream);
    

    A few libraries provide an API for reading/writing to disk (allowing you to skip the fwrite/fflush). Squash, gzip, and lzham come to mind, but there are probably others. For the most part, though, the library just compresses to a buffer and you're responsible for writing the buffer to a file.

    Your main obstacle here is going to be that a lot of algorithms don't support flushing. Off the top of my head, gzip, lzham, brotli, bzip2, lzma, zstd, and I think lz4f support flushing. bzip2 probably won't perform very well if you do a lot of flushing, and if this is a new system there probably isn't much reason to use gzip or lzma (zstd outperforms gzip, and brotli and lzham outperform lzma in pretty much every way).

    That said, if you're just trying to avoid data loss due to crashes in your code (i.e., you want to preserve data when your program crashes but you're not too worried about the OS crashing), you might want to consider splitting compression and I/O code out to a separate process. At that point, you end up with something similar to syslog, or newer structured logging APIs like journald, ASL, or the astonishingly unpleasant Windows Event Log API.