cassandradatastaxdsbulk

How do I limit the files generated by DSBulk UNLOAD to just one CSV file?


I have run below command in EC2 instance to unload data from cassandra and store it at some place in EC2, But I observing that for each dsbulk unload command it generates 2 json files irrespective of how large or small the file size is.

How do I have control over how many files are generated? example, Suppose I want a particular dsbulk unload to generate 5 part files instead of 2?

dsbulk unload -k custdata -t orderhistory -h '172.xx.xx.xxx' -c json -url proddata/json/custdata/orderhistory/data

Solution

  • The default behaviour for the DataStax Bulk Loader is to parallelise the tasks into multiple threads if the machine has multiple cores.

    To limit the number of written files to a single CSV, set the file concurrency to 1 with:

    $ dsbulk -maxConcurrentFiles 1 ...
    

    Just be aware that this will limit the throughput of DSBulk since it will be single-threaded.

    For details, see DSBulk Connector options. Cheers!

    [UPDATED] Use with a single dash (-) in -maxConcurrentFiles as advised by Alex Dutra/DSBulk dev. 🙂