phpcassandracqlcassandra-2.2nosql

Cassandra: select only latest rows


I work with following table:

CREATE TABLE IF NOT EXISTS lp_registry.domain (
    ownerid text,
    name1st text,
    name2nd text,
    name3rd text,
    registrar text,
    registered timestamp,
    expiration timestamp,
    updated timestamp,
    technologies list<text>,
    techversions list<text>,
    ssl boolean,
    PRIMARY KEY (
        (name1st, name2nd, name3rd), 
        registrar, ownerid, registered, expiration, updated
    )
);

Table isn't updated only new rows are added. Everytime crawler checks domain, new row is added.

I am performing this select:

SELECT * FROM lp_registry.domain WHERE 
    registrar = 'REG-WEDOS' AND 
    ownerid = 'FORPSI-JAF-S497436' 
ALLOW FILTERING;

But what I want in the result are only the rows with latest 'updated' value for each unique "name3rd.name2nd.name1st".

If I were in a standard SQL database, I would use nested select with MAX or GROUP BY. However, this is not supported by Cassandra (MAX(), DISTINCT and group by in Cassandra). But what I should do in CQL?


Solution

  • Extending onto Cedric's answer (which is great advice and would consider that as the answer to accept) you would get a table structure roughly like:

    CREATE TABLE IF NOT EXISTS lp_registry.domain (
        ownerid text,
        name1st text,
        name2nd text,
        name3rd text,
        registrar text,
        registered timestamp,
        expiration timestamp,
        updated timestamp,
        technologies list<text>,
        techversions list<text>,
        ssl boolean,
        PRIMARY KEY ((registrar, ownerid), updated, name1st, name2nd, name3rd)
    ) WITH CLUSTERING ORDER BY (updated desc);
    

    When selecting data it will return rows with the most recent updated values within the partition for the registrar and ownerid you are querying.

    This query would be incredibly fast because your data will be organized on disk by registrar, owner id with rows in order by updated descending.

    This is a key concept with cassandra in that your data is organized based on how you query it. You lose flexibility in your queries, but you can feel comfortable that you are going to get great performance because you are retrieving data as it is organized. This is why denormalizing your data based on your queries is vital.

    Where things become complicated is if you wanted to retrieve the most recently updated of all data. That problem is not easily solvable with cassandra unless everything shares the same partition which has its own set of problems (example strategy using a 'dummy' partition key).