cassandranosql

Cassandra poor performance when aggregating over 40 columns in a 500K partition


I have a cassandra table simillar to the below code

CREATE TABLE IF NOT EXISTS timebased_dev.sample_table (
    operation_id uuid,
    job_id uuid,
    timestamp timestamp,
    depth double,
    depth_is_null int,
    c0 double,
    c0_is_null int,
    c1 double,
    c1_is_null int,
    c2 double,
    c2_is_null int,
    ...
    c499 double,
    c499_is_null int,
    PRIMARY KEY ((operation_id, job_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

I would like to group by the timestamp using floor. I am expecting data every 500ms and I will have around 500k rows every three days. I would like to group by 5 minute intervals and do average for each column. Currently it takes around 100 seconds to do that. What am I doing wrong?

I tried using the built in function AVG and SUM. I need the average for some columns and the sum for others. I am not doing allow filtering. I have a single partition for the 500k rows. Should I use multiple paritions? Will that improve my query?

Here is a sample query I intend to run

SELECT floor(timestamp, 5m),
  AVG(c0),
  SUM(c0_is_null),
  AVG(c1),
  SUM(c1_is_null),
  AVG(c2),
  SUM(c2_is_null),
  AVG(c3),
  SUM(c3_is_null),
  AVG(c4),
  SUM(c4_is_null),
  AVG(c5),
  SUM(c5_is_null),
  AVG(c6),
  SUM(c6_is_null),
  AVG(c7),
  SUM(c7_is_null),
  AVG(c8),
  SUM(c8_is_null),
  AVG(c9),
  SUM(c9_is_null),
  AVG(c10),
  SUM(c10_is_null),
  AVG(c11),
  SUM(c11_is_null),
  AVG(c12),
  SUM(c12_is_null),
  AVG(c13),
  SUM(c13_is_null),
  AVG(c14),
  SUM(c14_is_null),
  AVG(c15),
  SUM(c15_is_null),
  AVG(c16),
  SUM(c16_is_null),
  AVG(c17),
  SUM(c17_is_null),
  AVG(c18),
  SUM(c18_is_null),
  AVG(c19),
  SUM(c19_is_null)
FROM timebased_dev.sample_table
WHERE operation_id = e2c5f52e-e3c1-40f8-aa96-ed8af74237b0
  AND job_id = 47913ea2-8c7b-44d3-a81f-46fbcd0c4bc9
  AND timestamp > '2024-01-15 22:34:51.9050+07:00'
  AND timestamp < '2024-01-18 22:34:51.9050+07:00'
GROUP BY floor(timestamp, 5m);

Solution

  • The behaviour you are seeing is expected. Your query is not OLTP but analytics in nature and very, very expensive considering the following factors:

    You should consider using Spark for analytics queries since it will be more efficient. Regardless, it will still be costly to run your query given the dataset size. Cheers!