cassandracqlcql3

Cassandra preventing duplicates


I have a simple table distributed by userId:

create table test (
  userId uuid,
  placeId uuid,
  visitTime timestamp,
  primary key(userId, placeId, visitTime)
) with clustering order by (placeId asc, visitTime desc);

Each pair (userId, placeId) can have either 1 or none visits. visitTime is just some data associated with it, used for sorting in queries like select * from test where userId = ? order by visitTime desc.

How can I require (userId, placeId) to be unique? I need to make sure that

insert into test (userId, placeId, timeVisit) values (?, ?, ?)

won't insert 2nd visit to (userId, placeId) with different time. Checking for existence before inserting isn't atomic, is there a better way?


Solution

  • Let me understand -- if the couple (userId, placeId) should be unique, (meaning that you don't have to put two rows with this pair of data) what is the timeVisit useful for in the primary key? Why would you perform a query using order by visitTime desc if this will have only one row?

    If what you need is to prevent duplication you have 2 ways.

    1 - Lightweight transaction -- this, using IF NOT EXISTS will do what you want. But as I explained here lightweight transactions are really slow due to a particular handling by cassandra

    2 - USING TIMESTAMP Writetime enforcement - (be careful with it!***) The 'trick' is to force a decreasing TIMESTAMP

    Let me give an example:

    INSERT INTO users (uid, placeid , visittime , otherstuffs ) VALUES ( 1, 2, 1000, 'PLEASE DO NOT OVERWRITE ME') using TIMESTAMP 100;
    

    This produces this output

    select * from users;
    
     uid | placeid | otherstuffs                | visittime
    -----+---------+----------------------------+-----------
       1 |       2 | PLEASE DO NOT OVERWRITE ME |      1000
    

    Let's now decrease the timestamp

    INSERT INTO users (uid, placeid , visittime , otherstuffs ) VALUES ( 1, 2, 2000, 'I WANT OVERWRITE YOU') using TIMESTAMP 90;
    

    Now data in the table have not been updated, since there is a higher TS operation (100) for the couple (uid, placeid) -- in fact here the output has not changed

    select * from users;
    
     uid | placeid | otherstuffs                | visittime
    -----+---------+----------------------------+-----------
       1 |       2 | PLEASE DO NOT OVERWRITE ME |      1000
    

    If performance matters then use solution 2, if performance doesn't matter then use solution 1. For solution 2 you could calculate a decreasing timestamp for each write using a fixed number minus the system time millis

    eg:

    Long decreasingTimestamp = 2_000_000_000_000L - System.currentTimeMillis();
    

    *** this solution might lead to unexpected behaviour if, for instance, you want delete and then reinsert data. It is important to know that once you delete data you will be able to write them again only if the write operation will have a higher timestamp of the deletion one (if not specified, the timestamp used is the one of the machine)

    HTH,
    Carlo