I have a cassandra table simillar to the below code
CREATE TABLE IF NOT EXISTS timebased_dev.sample_table (
operation_id uuid,
job_id uuid,
timestamp timestamp,
depth double,
depth_is_null int,
c0 double,
c0_is_null int,
c1 double,
c1_is_null int,
c2 double,
c2_is_null int,
...
c499 double,
c499_is_null int,
PRIMARY KEY ((operation_id, job_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);
I would like to group by the timestamp using floor
. I am expecting data every 500ms and I will have around 500k rows every three days. I would like to group by 5 minute intervals and do average for each column. Currently it takes around 100 seconds to do that. What am I doing wrong?
I tried using the built in function AVG
and SUM
. I need the average for some columns and the sum for others. I am not doing allow filtering. I have a single partition for the 500k rows. Should I use multiple paritions? Will that improve my query?
Here is a sample query I intend to run
SELECT floor(timestamp, 5m),
AVG(c0),
SUM(c0_is_null),
AVG(c1),
SUM(c1_is_null),
AVG(c2),
SUM(c2_is_null),
AVG(c3),
SUM(c3_is_null),
AVG(c4),
SUM(c4_is_null),
AVG(c5),
SUM(c5_is_null),
AVG(c6),
SUM(c6_is_null),
AVG(c7),
SUM(c7_is_null),
AVG(c8),
SUM(c8_is_null),
AVG(c9),
SUM(c9_is_null),
AVG(c10),
SUM(c10_is_null),
AVG(c11),
SUM(c11_is_null),
AVG(c12),
SUM(c12_is_null),
AVG(c13),
SUM(c13_is_null),
AVG(c14),
SUM(c14_is_null),
AVG(c15),
SUM(c15_is_null),
AVG(c16),
SUM(c16_is_null),
AVG(c17),
SUM(c17_is_null),
AVG(c18),
SUM(c18_is_null),
AVG(c19),
SUM(c19_is_null)
FROM timebased_dev.sample_table
WHERE operation_id = e2c5f52e-e3c1-40f8-aa96-ed8af74237b0
AND job_id = 47913ea2-8c7b-44d3-a81f-46fbcd0c4bc9
AND timestamp > '2024-01-15 22:34:51.9050+07:00'
AND timestamp < '2024-01-18 22:34:51.9050+07:00'
GROUP BY floor(timestamp, 5m);
The behaviour you are seeing is expected. Your query is not OLTP but analytics in nature and very, very expensive considering the following factors:
You should consider using Spark for analytics queries since it will be more efficient. Regardless, it will still be costly to run your query given the dataset size. Cheers!