I encounter a problem sorting data on Cassandra using CLUSTERING ORDER BY
.
The returned data is not sorted as expected with the following conf:
Any idea why the result is not sorted as expected ?
CREATE KEYSPACE IF NOT EXISTS foo WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };
CREATE TABLE IF NOT EXISTS foo.keyword (
keyword_id int,
keyword_text text,
PRIMARY KEY (keyword_text, keyword_id)
)WITH CLUSTERING ORDER BY(keyword_id DESC);
I add 5 records
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (1,'bar1');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (2,'bar2');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (3,'bar3');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (4,'bar4');
INSERT INTO foo.keyword (keyword_id, keyword_text) VALUES (5,'bar5');
And SELECT :
SELECT keyword_id FROM foo.keyword;
Obtained result :
2
4
3
1
5
Expected :
5
4
3
2
1
Any idea why the result is not sorted as expected ?
Yes. Cassandra supports ordering within a partition. When data is SELECTed without a WHERE clause, the result set is returned in order of the hashed token values of the partition key. In this case, the partition key is keyword_text
.
Let's try altering that SELECT to execute the token
function on keyword_text
:
SELECT keyword_id, token(keyword_text) FROM foo.keyword;
keyword_id | system.token(keyword_text)
------------+----------------------------
2 | -5699002630445609302
4 | -5246443235113384067
3 | 1880591418861795474
1 | 7350118602498431585
5 | 7355007313553950035
(5 rows)
As we can see, the order is indeed indicated by the hashed token value of keyword_text
. So, that's why the results are returned in this particular order.
Ok, so how do we fix it? Well, we'll need to choose a different partition key. I'm not sure about the underlying use case here, so we'll just use today's date to "bucket" the data into a partition named day_bucket
.
CREATE TABLE keyword (
day_bucket int,
keyword_id int,
keyword_text TEXT,
PRIMARY KEY (day_bucket, keyword_id)
) WITH CLUSTerING ORDER BY (keyword_id DESC);
Next, we'll INSERT some data:
INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 1,'bar1');
INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 2,'bar2');
INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 3,'bar3');
INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 4,'bar4');
INSERT INTO keyword (day_bucket, keyword_id, keyword_text) VALUES (20240502, 5,'bar5');
And then rerun our query using today's date in the WHERE clasue:
SELECT keyword_id FROM keyword WHERE day_bucket = 20240502;
keyword_id
------------
5
4
3
2
1
(5 rows)
Note that we shouldn't put all data into a single partition. That's not good for data distribution and will lead to unbound partition growth. The idea, is to find a partition key that works well with the use case.