In Martin Fowler's write-up of the LMAX-disruptor architecture, he says:
The journaler's job is to store all the events in a durable form, so that they can be replayed should anything go wrong. LMAX does not use a database for this, just the file system. They stream the events onto the disk.
I'm curious what the implementation of the file system based event log looks like in practice. The following answer says that it is written to a "raw file", but I'm interested in the actual details that one might implement for a production system. Is it literally a raw text file containing a structured log that is continuously appended to? Or is it some sort of binary format? Are there any critical design decisions that go into this component of the system?
The journaller as suggested needs to contain two pieces of information: the event itself as received and some sort of an identifier to track where in the journal you are so that you can pick to start from that record during replay.
Storage format is ultimately your decision, however the following considerations apply:
Replays may need to be triggered not just from system crashes but from bugs in your own code. The less manipulation of the input message byte stream the better. Any manipulation of the byte stream introduces a chance of bugs and makes your replay logic very different to "drop bytes back into the input buffer." To me this is probably the biggest decision.
Replays should be quick and not contain business logic. A file format that allows your storage device to store sequentially and not require back and forth hopping such as a database with indexes would require is going to be better for performance. The more layers you have between your ring buffer input and your storage layer the slower things will be.
Pre-allocated storage on the disk (you could even use a RAW partition) will allow you to write the bytes beginning to end without needing to update directory metadata and freespace tracking areas of the file system. This should simplify and improve performance. As long as this pre-allocation is enough to keep all data between checkpoints you will be fine. This becomes less of a concern over time with improvements in storage devices.