databasetime-seriesquestdb

Table suspended when doing rsync


I am ingesting data into QuestDB and I need fast ingestion times and fast query times over the recent data, but also frequent queries over the historical dataset with more relaxed performance requirements.

I have set up two instances, one with better hardware and NVMe drive and one with an HDD for the historical queries. I ingest the data directly on the fast instance and then I rsycn to the other. All is good, but once in a while I get problems with the slower machine, with tables being suspended and errors in the log such as

segment /var/lib/questdb/db/channels10~33/wal609/8/_event.i does not have txn with id 319, offset=21381, indexFileSize=2560, maxTxn=318, size=21381
segment /var/lib/questdb/db/channels25~37/wal131/3/_event.i does not have txn with id 326, offset=16308, indexFileSize=2616, maxTxn=325, size=16308

Any clues what I am doing wrong?


Solution

  • During normal operations, QuestDB writes to many different files (metadata, wal files, column data, optional column indexes...) with direct memory mapping and flushing to the disk periodically.

    At any given moment, there might be files that are consistent in memory, but not on disk yet. When we do a rsync, we are copying the files as they are on the OS at the moment, and some might be not flushed and be inconsistent.

    On QuestDB Enterprise, replication is out of the box, but when using QuestDB OSS there are a couple of ways to deal with this:

    a) Double writing. You can use a Kafka broker to send messages and then have both instances consuming from the same topic. This should work fine and if using deduplication it should be very reliable as in the case of errors instances would catch up. In this case only data gets replicated, no metadata. If data is deleted from an instance, or if a schema is changed, or if we change any other metadata, the other instance will not know about it.

    b) Treating the replication of data as a backup and restore. In this case, the replicated database will be exactly as the original database. To start a backup procedure you need to issue the CHECKPOINT CREATE command on the source instance. That will flush everything and will make sure you can copy data consistently. At this point you can rsync to the replica and, when rsync is done, you have to issue CHECKPOINT RELEASE on the main instance. This is important as otherwise you will be using a large amount of disk space. After data has been rsync, you can just initiate the restore procedure on the other instance as seen at the questdb docs.