databasetime-seriesquestdbexactly-once

Getting "phantom" rows that are duplicates but with different timestamps


I am inserting data into QuestDB using the Go client. I have enabled auto-flushing and I am using the http transport, so retries are automatic in the case of network errors or timeouts, which is convenient as the network is flaky between my clients and my server.

I have several sensors sending data each about every 10 seconds or so. Most of the times all is good, but I have observed sometimes I get some rows that look like a duplicate except for the timestamp, which is a few milliseconds apart. I can confirm my meters don't send individual data more often than once every 10 seconds or so, so it is impossible that I am sending the duplicate rows. I've seen this only on a few rows out of over a billion. It is not a big issue, but still annoying.

My relevant code looks something like this for each row. I use the default autoflush configuration, and the client sends data for several meters, at about 200 events per second.

    err = client.Table("smart_meters").
        Symbol("device_id", "BAA-142").
        Symbol("mark_model", "E23M-V86").
        Symbol("status", "OK").
        Float64Column("energy_consumption", 415).
        Float64Column("power_factor", 0.89).
        AtNow(ctx)

Any idea why I might be getting phantom rows?


Solution

  • This is probably happening because of the use of AtNow(). In the Influx Line Protocol, we can choose to send a timestamp with every event, or leave it blank for the server to fill with the server timestamp. In the QuestDB clients, the AtNow (or equivalent) API will leave the timestamp blank and will use the server ingestion timestamp.

    This is completely fine, but in this case the inserts are not idempotent, as when we retry to send a batch of data, the timestamp in the server will be different.

    The batches we send using the HTTP transport are per-table transactional, so if there is an error inserting any of the rows, no data will be ingested into the affected table. But if the rows have no errors, the server ingest the data, and because of a network error the client does not receive the response, the client will retry to send the data, with a blank timestamp, causing the row to be inserted twice with similar timestamps.

    There are two things we can do to avoid this scenario. The first one is setting the timestamp ourselves on the client. If the events don't have a native timestamp, we can always get the current processing timestamp, as in

    At(ctx,Time.now)

    If we do that and a batch of data is automatically re-sent by the client, it will keep the same timestamp. But we would still get a duplicate row. Unless we use DEDUP in our table.

    ALTER TABLE smart_meters DEDUP ENABLE UPSERT KEYS(timestamp, device_id)

    This is assuming the table timestamp column is named timestamp and also that the only deduplication key is the device_id.

    If we add DEDUP to the table, in the case of two rows with identical timestamp and device_id, only one row will be stored (the most recently received, in case the rest of columns have changed).

    This would be more robust if we can assign a timestamp directly from the information sent by the device, because otherwise we are using the timestamp at the client side. But this should suffice to avoid the original problem.