marklogicmarklogic-corb

Bulk data transformation using CoRB


I am doing data transformation using CoRB on a MarkLogic 3-node cluster with 128GB RAM servers.

Currently I am running my CoRB job with 16 threads(no parallel jobs are running). Is it OK to increase the thread count to improve the performance? If yes, then what is the maximum number of threads I can allocate to run the CoRB?


Solution

  • The short answer is yes. You should be able to increase the thread count of your CoRB job.

    However, there are a number of factors and considerations that will determine what the optimal thread count might be and whether it is helpful to do so.

    For instance, if you have already maxed out the available appserver threads (default is 32 per host), are pushing CPU to the max, and/or are encountering deadlocks, then adding more threads may not help, and can actually reduce throughput.

    If you have a 3-node cluster with all three configured for that XDBC appserver, then you would want to spread the load across all three nodes, to take advantage of available appserver threads and resources on those servers to perform the transformation. So, either run through a load-balancer, or configure the CoRB options to spread the load to multiple hosts.

    You can increase the appserver threads and then further increase the thread count to allow for more concurrent query executions. As long as the execution times remain fairly consistent and do not increase, then you should get more throughput.

    You may find that there are diminishing returns at some point. If you have a resource intensive job, you may find that increasing threads leads to increased demands (CPU load, lock-wait times, etc.) and that you hit a plateau and start to see longer execution times when more threads are applied, and may even see reduced rates. At that point, you would need to look to see if you can tune the query or scale up/out if you wanted more throughput.

    If you have configured the COMMAND-FILE or JOB-SERVER-PORT then you can dynamically adjust the thread count up/down as the job is running and monitor the rates as it is running to experiment and find the optimal thread count.