cassandradata-modelingcql

Cassandra order issues


I encounter a problem sorting data on Cassandra using CLUSTERING ORDER BY. The returned data is not sorted as expected with the following conf:

Any idea why the result is not sorted as expected ?

CREATE KEYSPACE IF NOT EXISTS foo WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

CREATE TABLE IF NOT EXISTS foo.keyword (
keyword_id int,
keyword_text text,
PRIMARY KEY (keyword_text, keyword_id)
)WITH CLUSTERING ORDER BY(keyword_id DESC);

I add 5 records

INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (1,'bar1');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (2,'bar2');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (3,'bar3');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (4,'bar4');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (5,'bar5');

And SELECT :

SELECT keyword_id FROM foo.keyword;

Obtained result :

2
4
3
1
5

Expected :

5
4
3
2
1

Solution

  • Any idea why the result is not sorted as expected ?

    Yes. Cassandra supports ordering within a partition. When data is SELECTed without a WHERE clause, the result set is returned in order of the hashed token values of the partition key. In this case, the partition key is keyword_text.

    Let's try altering that SELECT to execute the token function on keyword_text:

    SELECT keyword_id, token(keyword_text) FROM foo.keyword;
    
     keyword_id | system.token(keyword_text)
    ------------+----------------------------
              2 |       -5699002630445609302
              4 |       -5246443235113384067
              3 |        1880591418861795474
              1 |        7350118602498431585
              5 |        7355007313553950035
    
    (5 rows)
    

    As we can see, the order is indeed indicated by the hashed token value of keyword_text. So, that's why the results are returned in this particular order.

    Ok, so how do we fix it? Well, we'll need to choose a different partition key. I'm not sure about the underlying use case here, so we'll just use today's date to "bucket" the data into a partition named day_bucket.

    CREATE TABLE keyword (
      day_bucket int,
      keyword_id int,
      keyword_text TEXT,
      PRIMARY KEY (day_bucket, keyword_id)
    ) WITH CLUSTerING ORDER BY (keyword_id DESC);
    

    Next, we'll INSERT some data:

    INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 1,'bar1');
    INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 2,'bar2');
    INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 3,'bar3');
    INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 4,'bar4');
    INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 5,'bar5');
    

    And then rerun our query using today's date in the WHERE clasue:

    SELECT keyword_id FROM keyword WHERE day_bucket = 20240502;
    
     keyword_id
    ------------
              5
              4
              3
              2
              1
    
    (5 rows)
    

    Note that we shouldn't put all data into a single partition. That's not good for data distribution and will lead to unbound partition growth. The idea, is to find a partition key that works well with the use case.