databasecassandrapartitioningshardingcolumn-oriented

Cassandra follows which partitioning technique?


I am new to Cassandra and while reading about partitioning a database - vertical and horizontal, I got confused and would like to know whether Cassandra follows Horizontal partitioning (sharding) OR vertical partitioning technique?

Moreover, according to my understanding, as Cassandra is column oriented DB, it should follow Vertical partitioning technique. If this is not the case then can anyone please explain it in detail?


Solution

  • as Cassandra is column oriented DB

    This point has been discussed ad-nauseam on Stack Overflow, specifically in this answer. Cassandra is NOT a column oriented database. It is a partitioned row store. Data is organized and presented in "rows," similar to a relational database.

    whether Cassandra follows Horizontal partitioning (sharding)

    Technically, Cassandra is what you would call a "sharded" database, but it's almost never referred to in this way. Essentially, each node is responsible for a specific range of partitions. These partitions (tokens) are a numeric value, and with the Murmur3Partitioner range from -2^63 to +2^63-1.

    In fact, in a scenario where a node is simplified to hold a single token range, you can compute the ranges based on the number of nodes in the cluster (data center) like this:

    python -c 'print [str(((2**64 / 6) * i) - 2**63) for i in range(6)]'
    
    ['-9223372036854775808', '-6148914691236517206', '-3074457345618258604',
     '-2', '3074457345618258600', '6148914691236517202']
    

    Of course with vNodes, a node is almost always responsible for multiple token ranges.

    At operation time, the partition key is hashed into a token. This token tells Cassandra which node the data resides on. Consider this table:

    SELECT token(studentid),studentid,fname,lname FROM student ;
    
     system.token(studentid) | studentid | fname | lname
    -------------------------+-----------+-------+----------
        -5626264886876159064 | janderson | Jordy | Anderson
        -1472930629430174260 |   aploetz | Avery |   Ploetz
         8993000853088610283 |      mgin | Micah |      Gin
    
    (3 rows)
    

    As this table has a simple primary key definition of studentid, that is used as the partition key. The results of the token(studentid) function above indicate which partitions contain the data.

    If there was another table which also used studentid as its partition key, that table's data would be stored on the same nodes as the student table.

    In any case, this is a simplified version of what happens. Feel free to read up on vNodes (link above) as well as Cassandra: High Availability by Robbie Strickland. He has written (IMO) the best description of Cassandra's hashing and partition distribution process.