cassandradatastaxcqlpartitionbucketing

CQL retrieve timeseries data by time range


I have sensors at different locations, each measuring multiple parameters. There will be around 2 millions of measurements per day per sensor. I need to query by location/time range, but the range specified will never be larger than couple of days. An example query would be:

SELECT * FROM meas WHERE latitude=46.1 AND longitude=15.1 AND measurementTime >= 2023-10-15T23:00:00Z AND measurementTime <= 2023-10-16T01:00:00Z;

To control partition size I can split partitions (I've read somewhere that this is called bucketing) based on time window - e.g. by day after epoch:

CREATE TABLE meas (
    latitude double,
    longitude double,
    dayOfMeasurement int,
    measurementTime timestamp,
    param1 double,
    param2 double, 
    param3 double, 
    etc...
    PRIMARY KEY ((latitude, longitude, dayOfMeasurement), measurementTime)
) WITH CLUSTERING ORDER BY (measurementTime DESC)

Now I have two options:

  1. use IN in WHERE statement - like (assuming there were 19646 days from epoch to 16 Oct 2023):
SELECT * FROM meas WHERE latitude=46.1 AND longitude=15.1 AND dayOfMeasurement IN (19646,19645) AND measurementTime >= 2023-10-15T23:00:00Z AND measurementTime <= 2023-10-16T01:00:00Z;
  1. or use multiple (two in the above example) separate queries and combine results on client side (I am using datastax Cassandra driver).
SELECT * FROM meas WHERE latitude=46.1 AND longitude=15.1 AND dayOfMeasurement=19645 AND measurementTime >= 2023-10-15T23:00:00Z AND measurementTime < 2023-10-16T00:00:00Z;
SELECT * FROM meas WHERE latitude=46.1 AND longitude=15.1 AND dayOfMeasurement=19646 AND measurementTime >= 2023-10-16T00:00:00Z AND measurementTime <= 2023-10-16T01:00:00Z;

Which of these two options is better? If option 2 is better - is there a standard way of combining the results of multiple queries in datastax java driver? Or should I use some other data model?


Solution

  • There will be around 2 millions of measurements per day per sensor

    I'd be curious as to how big the partitions are. TBH I would recommend breaking it down by hour in addition to day, as that seems like too many rows-per-partition. But if it already works fine, then it might be ok.

    use multiple (two in the above example) separate queries and combine results on client side

    This is the approach that I'd recommend. The problem with the other approach (using IN) is that the order of the results returned will be inconsistent. It will be grouped by day and sensor and ordered by measurementtime within. But the order of the partitions will depend on the hash value of the token, which will not be consistent. So I'd say that some client-side work is necessary to ensure consistent behavior at the app level.

    Can you please elaborate a little further?

    So generally speaking, it's a good idea to keep partition sizes under 1 or 2 MB. It's also a good practice to keep the number of rows per partition <= 100,000. When rowcounts or partition sizes exceed those values is when operations tend to slow down.

    Now, those aren't hard, mathematical limits; just general guidelines. It's quite possible that a partition w/ 2 million really, really small rows could still work. I've even seen partitions sized at 10 MB or more be ok. It really comes down to payload sizes and access patterns. That's what I meant when I said above "if it already works, then it might be ok."