cassandracqlcassandra-3.0cql3

Cassandra need for IN clause in consideration of a messaging application


For a messaging app i have a database structure comparatively to:

CREATE TABLE users(
    userid text,
    name text, 
    rooms list<text>
    ...
    PRIMARY KEY (userid)
);

CREATE TABLE rooms(
    roomid text,
    members list<text>,
    createdat bigint,
    lastmessage bigint,
    ...
    PRIMARY KEY (roomid, createdat)
);

CREATE TABLE messages(
    roomid text,
    bucket int,
    messageid bigint,
    authorid text,
    ...
    PRIMARY KEY ((hash, roomid), messageid)
);

On startup the client requests all rooms for a given user. First I query all roomids for the given user with:

SELECT rooms FROM users WHERE userId = 1234

Then i use the IN clause to gather all rooms

SELECT * FROM rooms WHERE roomid IN ('room_1', 'room_2', ......);

and return the entities to the client.

I have researched, that the IN clause could lead to putting one node under a lot of pressure. I expect users to have up to a hundred rooms.

Must I split the request into single queries or is their another way like changing the data model ?

Why does the IN clause leads to pressure on a single node ?

Thanks in advance !


Solution

  • You are correct in that you should limit the number of keys in the IN() operator. I generally recommend very low single-digit number of keys like 2 or 3, not much more, or the coordinator will be under a lot of pressure since it has to fire off as many separate requests.

    You are right that you should model your data differently to get optimal performance.

    I would highly recommend creating a new table that is partitioned by user IDs:

    CREATE TABLE rooms_by_userid (
      ...
      PRIMARY KEY (userid, roomid)
    )
    

    When you query the table with:

    SELECT ... FROM rooms_by_userid WHERE userid = 1234
    

    you'll get rows of data clustered by room IDs. This is the best way to model your data since it is organised based on the app requirement.

    Your current model is effectively doing a clumsy JOIN by having to query 2 tables. The way I'm proposing means that you only need to retrieve data from one table so it's really efficient. Cheers!