cassandra

What exactly is cassandra partition?


I read a lot of cassandra docs, I understand that we have partition key, hash of this key used to split data between partitions to evenly distribute data between nodes.

But what exactly is partition ? Is it a table, or some subset in table, or just another calculated stuff used in order rows on node ? Is it a pure virtual thing, or some real entity that give some overhead ?

Is it better to limit amount of partitions ? For example, I can take remainder from uuid division and use it as partition key, that still equalise data between partitions, but keep partition count low, or I can just use whole uuid ?


Solution

  • Thinking of it as a subset of a table is a good start. Each node is responsible for a specific range of partitions (AKA token ranges). This is how Cassandra data is distributed across the nodes. The token ranges are calculated based on the hashed value of the partition keys, regardless of which table the data is in.

    So basically, a partition is a subset of data that is guaranteed to be on a single node. That's why we say for optimal performance, data that is queried together should be stored together (same partition).

    The only time a query would have any overhead (related to a partition), would be if it is trying to query multiple partitions. Multi-partition (partition key) queries are bad, because an exact node (containing all of the data to be returned) cannot be determined. Thus, a coordinator then does an exhaustive search throughout the cluster.

    Is it better to limit amount of partitions ?

    No. You definitely want more partitions. That will help the data be more-evenly distributed throughout the cluster. Likewise, this also helps distribute the operational activity around the cluster, and helps to protect a subset of nodes from being overloaded.