I have run below command in EC2 instance to unload data from cassandra and store it at some place in EC2, But I observing that for each dsbulk unload command it generates 2 json files irrespective of how large or small the file size is.
How do I have control over how many files are generated? example, Suppose I want a particular dsbulk unload to generate 5 part files instead of 2?
dsbulk unload -k custdata -t orderhistory -h '172.xx.xx.xxx' -c json -url proddata/json/custdata/orderhistory/data
The default behaviour for the DataStax Bulk Loader is to parallelise the tasks into multiple threads if the machine has multiple cores.
To limit the number of written files to a single CSV, set the file concurrency to 1 with:
$ dsbulk -maxConcurrentFiles 1 ...
Just be aware that this will limit the throughput of DSBulk since it will be single-threaded.
For details, see DSBulk Connector options. Cheers!
[UPDATED] Use with a single dash (-
) in -maxConcurrentFiles
as advised by Alex Dutra/DSBulk dev. 🙂