cassandracassandra-jdbc

Cassandra query timeout


We are pulling data from around 20-25 industrial motor sensors and data is being stored in cassandra database.Cassandra as of now is running in a single node.

Below is the table structure

CREATE TABLE cisonpremdemo.machine_data (
    id uuid PRIMARY KEY,
    data_temperature bigint,
    data_current bigint,
    data_timestamp timestamp,
    deviceid text,
    
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND default_time_to_live = 7884000
    AND gc_grace_seconds = 100;
	
CREATE INDEX deviceid_idx ON db.machine_data (deviceid);
CREATE INDEX data_timestamp_idx ON db.machine_data (data_timestamp);

Data is being collected in this table for couple of months say at every 5 seconds for almost 24 hours so there is pretty huge volume of data.

I am trying to execute a date range based query using java and dotnet and in both cases i am getting time out errors (Cassandra failure during read query at consistency LocalOne (0 replica(s) responded over 1 required))

Query works fine if i give limit of 100 otherwise it fails anything above than that.Some of the things i have tried...

1) increased query time out. 2) reduced gc_grace_seconds to 100 (temporarily) to eliminate any tombstones.

Query used

SELECT data_temperature AS "DATA_TEMP",data_current AS "DATA_CURRENT" FROM machine_data 
WHERE DATA_TIMESTAMP>=1517402474699 
AND DATA_TIMESTAMP<=1517402774699 
AND DEVICEID='BP_100' ALLOW FILTERING;

Not sure if i the table structure (primary key) is of a wrong choice. should it be both deviceid and timestamp ??


Solution

  • The secondary indexes will almost surely fail. They should have "not to low, not to high" cardinality (which depends on # of nodes in ring). Its very hard to get right and you should really just avoid using it unless have strong need and the data fits (cross table consistency not possible with a denormalized table).

    Another thing you should never use is allow filtering, thats there pretty much just for debugging/development and large spark job kinda things that are reading entire dataset. Its horribly expensive and will almost always result in timeouts long term.

    Instead you should create new tables and also break them up by time so the partitions do not get too large. ie

    CREATE TABLE cisonpremdemo.machine_data_by_time (
        id uuid PRIMARY KEY,
        data_temperature bigint,
        data_current bigint,
        data_timestamp timestamp,
        yymm text,
        deviceid text,
        PRIMARY KEY ((deviceid, yymm), data_timestamp)
    ) WITH CLUSTERING ORDER BY (data_timestamp DESC);
    

    When you insert your data, write to both. You should essentially create a table for each kind of request you have, so the data is in the format you need it. Do not model your table around how the data looks. If you do not need direct message lookups by uuid, do not make the machine_data table like you have above at all since thats not how you are querying it.