cassandracqldatastax-java-drivercql3

secondary indexes for low cardinality columns cassandra


we have a table with 15 million records, and ours is a 10 node cassandra cluster. We have a column which has close to 20 repeatable values. Is it advisable to build secondary index on this column?


Solution

  • Assuming completely uniform distribution on that column, then each column value would map to 750,000 rows. Now while the DataStax doc on When To Use An Index states that...

    built-in indexes are best on a table having many rows that contain the indexed value.

    750,000 rows certainly qualifies as "many." But even given that, remember that you're also talking about 14,250,000 rows that Cassandra has to ignore when fulfilling your query.

    Also, unless you have a RF of 10 (and I doubt that you would with 10 nodes), you are going to incur network time as Cassandra works between all of the different nodes required to fulfill your query. For 750,000 rows, that's probably going to timeout.

    The only way I think this could be efficient, would be to first restrict your query by a partition key. Using the secondary index while also restricting with a partition key will help Cassandra find your rows more quickly. Even so, with a dataset that big, I would re-evaluate your data model and try to figure out a different table to fulfill that query without requiring a secondary index.