javacassandrahector

what is the way for counting number of columns in row in Cassandra?


what is best practice for counting number of columns? I tried find any way to do this simply but not found any information about this. I think that i can do some query for this, but it maybe too slow. Then i know about way with use counter column, but i have not any idea how it would work with columns with ttl. summary, i need some method or way in Hector for check that row has not more than 5 columns in one minute. Thank for you advices and sorry for my bad language and noob question.


Solution

  • Don't really think counters would be useful for this.

    Are you trying to count the number of columns, or the number of "rows" inside of a single "multi partition row"? Say you're table looks like this:

    create table foo ( id text, colid text, somethingcol text, othercol text, primary key (id, colid));

    Are you trying to count the number of entries for id='something specific'?

    Assuming that you are, you can add a timestamp column as the first clustering key like this:

    create table foo ( id text, ts timestamp, colid text, somethingcol text, othercol text, primary key (id, ts, colid));

    And run a query like: select count(*) from foo where id='theId' and [ts condition here];

    If you're looking to aggregate to see if there is ANY case where there are more than five entries in any one minute, then a different structure might be appropriate (e.g. you have the minute as a clustering key and entries within it inside a SET column). Whether you can use that for your original purpose is something you'll need to check.

    If you don't have too many entries in a partition, it might be possible to simple query the timestamps for a partition, and do the grouping + check client side. Spark is another possibility (maybe with Shark) if you want more sophisticated aggregation. Of course, that means having access to a bit more infra.

    Does that help?

    UPDATE: Cassandra maintains a timestamp for each cell automatically. You can access that in a query:

    SELECT id, bar, baz, writetime(bar) from foo where ...;

    Will have the last updated timestamp for the bar column for each entry. The timestamp is assigned by the coordinator unless the client specifies it during write [during insert, you can do a with timestamp=n if you wish to specify it]. It's worth noting this is the behaviour using CQL, not thrift.