cassandrascylla

Cassandra/Scylla optimal storage model for Category-Item entities


Question regarding Cassandra or ScyllaDB database schema for maximum performance.

There are two entities: Category 1—* Item (there can be several items in one category). Only the following operations are possible:

  1. Create a new category
  2. Add a new item to the category.
  3. Read the entire category along with its items. (there are always few items per category, no more than 50).

Which scheme will be the best in terms of performance?

  1. Two tables. Table for categories, table for items. Items key = (categoryId, itemId)
  2. One table for categories and items. Items are stored in the category field as a list (set)
  3. ? something else

Will adding a new Item be faster in the first case or in the second? Reading, of course, from one table will be faster.


Solution

  • Based on the 3 types of operations you described, I'm assuming no updates or deletions in the items in each category.

    The simple answer is that a single table would perform well against the operations required. Something like the following would be a good use case:

    CREATE TABLE mykeyspace.items_by_category (
        categoryId timeuuid,
        itemId timeuuid,
        ... 
        PRIMARY KEY (categoryId, itemId)
    );
    

    Another aspect to pay attention to is the expected volume within each partition. In this case, each category would be a partition, and each partition would scale up with the volume of data in each item row it stores. It's ideal to keep partition sizes under 100MB to avoid issues with compaction and garbage collection (GC is not a problem in ScyllaDB, though). Depending on other columns you'd want to include to represent each item this scenario may or not be possible.

    For example, if you have 50 items in one collection, where each item is represented by 3MB of data on average, you may already stumble into some performance issues - this can be addressable by bucketing the partition, or moving item data to a different table.

    You can also create a table for items, but as long as you don't expect partition size to get out of hand in the items_by_category table, I'd keep all category and item data in a single table. Otherwise you'll have to manage referential integrity at the application - Cassandra doesn't support joins neither understands the concept of a foreign key.

    I would avoid using collections in general, especially lists. The main reason being that lists perform reads-before-writes on some non-idempotent operations. Since itemId is unique, using a set would be preferable to a list if collections are a must.

    Furthermore, using itemId as a clustering column rather than an collection entry is likely more scalable if the requirements change in the future and collections end up storing more than 50 items.

    In the 2 options you provided, adding an item as an entry in a list can potentially take longer than adding it as a column value in 2 tables (potentially, because we're comparing 2 simple upserts with a read-before-write operation, and that'd need some testing). Nonetheless, the most performant solution should be adding each item to a single table - which is faster than to two tables, assuming it's a simple upsert in either case.

    This is the theory, and even though I'm fairly confident that the 1 table option (without collections) would be optimal, I'd still strongly recommend setting up a cassandra-stress profile with different models of workloads in a test cluster to get a sense of performance.