[SOLVED] How do I run dsbulk unload and write directly to S3

How do I run dsbulk unload and write directly to S3

I want to run a dsbulk unload command, but my cassandra cluster has ~1tb of data in the table I want to export. Is there a way to run the dsbulk unload command and stream the data into s3 as opposed to writing to disk?

Im running the following command in my dev environment, but obviously this is just writing to disk on my machine

bin/dsbulk unload -k myKeySpace -t myTable -url ~/data --connector.csv.compression gzip

Solution

It doesn't support it "natively" out of the box. Theoretically it could be implemented, as DSBulk is now open source, but it should be done by somebody.

Update: The workaround could be, as pointed by Adam is to use aws s3 cp and pipe to it from DSBulk, like this:

dsbulk unload .... |aws s3 cp - s3://...

but there is a limitation - the unload will be performed in one thread, so unloading could be much slower.

In the short term you can use Apache Spark in the local master mode with Spark Cassandra Connector, something like this (for Spark 2.4):

spark-shell --packages com.datastax.spark:spark-cassandra-connector-assembly_2.11:2.5.1

and inside:

val data = spark.read.format("org.apache.spark.sql.cassandra")\
   .options(Map( "table" -> "table_name", "keyspace" -> "keyspace_name")).load()
data.write.format("json").save("s3a://....")