Defining a partition key in Cassandra

I'm playing around with Cassandra for the first time and I feel like I understand the basics and limits. I'm working with the following model, as an example, for storing tweets collected by hashtag.

create table posts
(
    id text,
    status text,
    service text,
    hashtag text,
    username text,
    caption text,
    image text,
    link text,
    repost boolean,
    created timestamp,
    primary key (hashtag, created)
);

This works very well for the type of query I need:

select * from posts where hashtag = 'demo' order by created desc;

However, if I understand things correctly, there is an upper limit to the number of posts I could store using the singular 'demo' partition key and more importantly, the entire set of posts matching the 'demo' partition key would have to be stored with each replica. I'd should probably use a more random or variable partition key (maybe the id of the post) if I understand correctly, but I don't know what to use that won't alter the requirements for the query.

If I use id as the partition key (e.g. PRIMARY KEY (id, created)) and add a secondary index on the hashtag column, I get the following error when I run my query:

ORDER BY with 2ndary indexes is not supported.

I get that to use ORDER BY, the partition key must be featured in the where clause, hence my original thought to use hashtag.

Am I overthinking things or is there a better candidate for the partition key?

Solution

The direction you go would depend on what volume of writes you expect and how big your cluster is.

If you have a small user community and a small cluster, then you might be overthinking things. A partition can theoretically hold up to 2 billion rows. That's a big number, and would anyone actually want to view more than a few thousand of the most recent tweets for a hashtag? So you'd probably have some kind of cleanup mechanism such as using TTL to delete tweets after some amount of time, which will free up space in the partition, keeping you well below the 2 billion row limit.

If you don't want to cleanup up old tweets, but want to preserve them for many years, then you might want to use a compound partition key like this:

primary key ((hashtag, year), created)

This would partition the tweets by the tag and the year, so you could store up to 2 billion tweets per tag per year.

The nice thing about partitioning by hashtag is that Cassandra can keep the tweets for a tag sorted by the creation timestamp, making it easy to retrieve the most recent ones with a single query as you've shown.

But if your user community is big, then the issue that is of a bigger concern is avoiding hot spots. If you use just hashtag and a time bin like year for a partition key, then all reads and writes will be to the small number of replicas for that hashtag. If a hashtag is very active on a given day, then you've got all your reads and writes going to just a node or two depending on what replication factor you are using.

If you want to spread out the read and write load, you need to increase the cardinality of a hashtag so that it will map to multiple nodes. Using id as the partition key would achieve this, but that would be going too far since then every tweet would be in a separate partition and you'd get no sorting or easy way to retrieve the most recent tweets for a hashtag.

So a better approach is to create separate bins or buckets, like this:

primary key ((hashtag, bin), created)

The number of bins you create depends on your write load. Let's say you decide that ten nodes can handle the write load for a hot hashtag, then bin would be a value from 0 to 9.

There are a number of ways to set the bin number. You could do a modulo of id by 10, or pick a random number between 0 and 9, or generate a hash value from some combination of fields and take modulo 10 of the results. Whatever method you choose, make sure the numbers from 0 to 9 are equally likely so that your data is spread equally across the bin partitions.

With multiple bins, it is not as easy to retrieve the x most recent tweets for a hashtag since you need to query all the bins and merge the results. You can asynchronously issue a query for each bin of a hashtag in parallel and then merge the results on the client side. Or you can do a single query using the IN clause like this:

select * from posts where hashtag = 'demo' and bin IN (0,1,2,3,4,5,6,7,8,9) AND created > ...

But Cassandra won't sort the results of the single query, so you'd have to do a sort on the client side, which is slower than doing a merge of separate ordered queries.

Now in many cases there will be hashtags that have very little volume, so you might not want to bother using ten bins for them unless they get hot. If so you can make it dynamic in your application, typically using just bin 0, but then increasing the number of bins when a tag is found to be popular. You could use a static column in bin 0 to keep track of the number of active bins for a hashtag.

You should avoid using secondary indexes. They are very inefficient in Cassandra.