cassandrablobstoreastyanax

Storing binary blobs in Cassandra


I am building a simple HTTP service, that stores arbitrary binary objects. The service is backed by Cassandra. It is a simplified version of Amazon's S3. The system must withstand a heavy write load and should be highly available on the write and read path.

The stored data is kind of immutable. It can be deleted, but it cannot be updated. Therefore, data inconsistency is not an issue. The datastore must be able to efficiently expire old data.

The service uses Netflix's Astyanax library, which provides a recipe for storing (large) binary objects in Cassandra.

I see two solution to tackle the problem, which both have pros and cons. For me it is hard to estimate, which way fits Cassandra better.

Single table with TTL

Astyanax automatically chunks large objects into small pieces and stores them into a single table. A TTL is assigned to each blob to expire it after a certain period of time. A compaction run removes blobs, when the TTL is expired.

This solutions works and is pretty straight forward to implement. I started using the SizeTieredCompactionStrategy, but I think, that DateTieredCompactionStrategy might be the better choice, when dealing with TTL data.

My main concern is: can Cassandra's compaction keep up? Has anyone experience with a similar use case?

Sharding data by time

Another approach would be to shard the data by time. I could create a table for each day and store the chunks in that table. In this case I can drop the complete table to get rid of the expired data.

This solution requires a little more effort in the implementation, but simplifies and probably speeds up the deletion of expired data.

How performant is Cassandra in dropping a table?


Solution

  • Correct option for your scenario is DateTieredCompactionStrategy and Assign TTL to each blob.

    Refer: http://www.datastax.com/dev/blog/datetieredcompactionstrategy