I'm new to Cassandra and I am building a Chat Application. Assuming I have to store the chat messages in a DB, I expect to use Cassandra, since it allows fast writes. My data model for the "Messages" table is the following :-
message_id
from_user_id
to_user_id
channel_id
message_text
Now, I want to shard this in a way so that chat history for a particular channel (1:1 or Group chat) is accessible from a single shard. As such, ideally, I would like to shard the DB based on channel_id. Here are the challenges I have:-
Here are my questions :-
Could someone help me understand how best to model this? And please correct me if my understanding is wrong. (As mentioned I'm new to Cassandra and non-relational DBs in general)
In Cassandra, table's primary key is made of two parts: a partition key (which can be a simple key made of one column or a compound key made of multiple columns) and optional clustering keys (aka sort keys).
The partition key is responsible for data distribution across your nodes, i.e. determines the "shard".
The clustering key is responsible for data sorting within the partition.
Example:
create table example (
col_A text,
col_B text,
col_C text,
col_D text,
col_E text,
col_F text,
PRIMARY KEY((col_A, col_B), col_C, col_D)
);
In this example, values of col_A and col_B determine the partition, while col_C and col_D are still part of the primary key, but only define the order of data within that partition.
In your case, you can have channel_id as the partition key, and then message_id as clustering key that would sort records within that that partition (shard).
A table might look like this:
CREATE TABLE messages (
channel_id text,
message_id text,
from_user_id text,
message_text text,
to_user_id text,
PRIMARY KEY (channel_id, message_id)
) WITH CLUSTERING ORDER BY (message_id ASC)
Example data:
select * from messages;
channel_id | message_id | from_user_id | message_text | to_user_id
------------+------------+--------------+--------------+------------
1 | m-1 | u-1 | foo | u-2
1 | m-2 | u-2 | bar | u-1
2 | m-1 | u-11 | foo bar | u-10
2 | m-10 | u-10 | bar | u-11
2 | m-11 | u-11 | foo | u-10
select * from messages where channel_id = '1';
channel_id | message_id | from_user_id | message_text | to_user_id
------------+------------+--------------+--------------+------------
1 | m-1 | u-1 | foo | u-2
1 | m-2 | u-2 | bar | u-1
The thing that determines data placement on the server side based on partition key values is called a partitioner and is configurable in the cassandra.yaml file on the server.
Basically, a partitioner is a function for deriving a token representing a row from its partition key, typically by hashing. Each row of data is then distributed across the cluster by the value of the token.
Cassandra offers the following partitioners that can be set in the cassandra.yaml file.
You can read more about partitioners here: https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/architecture/archPartitionerAbout.html