cassandracassandra-2.0

Data Versioning in Cassandra with CQL3


I'm struggling with the data modelling for the use cases I'm trying to satisfy. I looked at this and this and even this but they're not exactly what I needed.

I have this basic table:

CREATE TABLE documents (
    itemid_version text,       
    xml_payload text,
    insert_time timestamp,
    PRIMARY KEY (itemid_version)
); 

itemid is actually a UUID (and unique for all documents), and version is an int (version 0 is the "first" version). xml_payload is the full XML doc, and can get quite big. Yes, I'm essentially creating a versioned document store.

As you can see, I concatenated the two to create a primary key and I'll get to why I did this later as I explain the requirements and/or use cases:

I will be writing the client code that will perform the use cases, please excuse the syntax as I'm trying to be language-agnostic.

First one's straightforward:

$itemid_version = concat($itemid, $version)
$doc = csql("select * from documents where itemid_version = {0};" 
    -f $itemid_version)

Now to satisfy the 2nd and 3rd use cases, I am adding the following table:

CREATE TABLE document_versions (
    itemid uuid,
    version int,
    PRIMARY KEY (itemid, version)
) WITH clustering order by (version DESC);

The new records will be added as new docs and new versions of existing docs are created.

Now we have this (use case #2):

$latest_itemid, $latest_version = csql("select itemid, 
    version from document_versions where item_id = {0} 
    order by version DESC limit 1;" -f $itemid)
$itemid_version = concat($latest_itemid, $latest_version)
$doc = csql("select * from documents where itemid_version = {0};" 
    -f $itemid_version)

and this (use case #3):

$versions = csql("select version from document_versions where item_id = {0}" 
    -f $itemid)

For the 3rd requirement, I am adding yet another table:

CREATE TABLE latest_documents (
    itemid uuid,
    version int,
    PRIMARY KEY (itemid, version)
)

The records are inserted for new docs, records are updated for existing docs.

And now we have this:

$latest_itemids, $latest_versions = csql("select itemid, version 
    from latest_documents where item_id in ({0})" -f $itemid_list.toCSV())

foreach ($one_itemid in $latest_itemids, $one_version in $latest_versions)
    $itemid_version = concat($latest_itemid, $latest_version)
    $latest_docs.append(
        cql("select * from documents where itemid_version = {0};" 
        -f $itemid_version))        

Now I hope it's clear why I concatenated itemid and version to create an index for documents as opposed to creating a compound key: I cannot have OR in the WHERE clause in SELECT.

You can assume that only one process will do the inserts/updates so you don't need to worry about consistency or isolation issues.

Am I on the right track here? There are quite a number of things that doesn't sit well with me...but mainly because I don't understand Cassandra yet:


Solution

  • This is actually very similar to your solution except that you can store all the versions and be able to fetch the 'latest' version just from one table (document_versions).

    In most cases, I think you can get what you want in a single SELECT except use case #2 which requires fetching the most recent version of a document where a SELECT is first needed on document_versions.

    CREATE TABLE documents (
            itemid_version text,
            xml_payload text,
            insert_time timestamp,
            PRIMARY KEY (itemid_version)
    );
    
    CREATE TABLE document_versions (
            itemid text,
            version int,
            PRIMARY KEY (itemid, version)
    ) WITH CLUSTERING ORDER BY (version DESC);
    
    
    INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
    INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');
    INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
    INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');
    
    INSERT INTO document_versions (itemid, version) VALUES ('doc1', 1);
    INSERT INTO document_versions (itemid, version) VALUES ('doc1', 2);
    INSERT INTO document_versions (itemid, version) VALUES ('doc2', 1);
    INSERT INTO document_versions (itemid, version) VALUES ('doc2', 2);