cassandracqlcql3

Can cassandra create table over files stored in S3?


I know that Athena allows you to query data directly from files stored in Amazon S3 using SQL syntax. Athena doesn't have a storage system; instead, it creates a schema and metadata layer on top of the data stored in S3.

I know Hive also does something similar. With Hive, you can define external tables that reference data stored in S3, allowing you to query that data using HiveQL, which is similar to SQL.

Can we do something like this in Cassandra? I want to have my files in S3 and query them using CQL in Cassandra.

Cassandra has a partition key and a clustering key. The file name could be the partition key value and the clustering key should be data inside the file in sorted order.

Is there a way we can achieve such a solution?


Solution

  • At the time of writing, this feature isn't available in Cassandra. The only way Cassandra can read data is if it is in the appropriate SSTable format.

    Depending on your motivation, there are products from DataStax which may tangentially provide the benefits you are after. As a matter of full disclosure, I am a DataStax employee but I am also an Apache Cassandra committer so take from my response what you wish.

    DataStax Enterprise (DSE) has a feature called Tiered Storage which provides different storage options such that data is moved automatically between different types of storage depending on whether the data is "hot" (frequently-used) or seldom accessed.

    Alternatively if you are looking to save costs, Astra DB which is a Cassandra service on the cloud stores data files (SSTables) on object stores including S3.

    Either way, the data files are in Cassandra's SSTable format, regardless of where they are stored. Cheers!