I am a new to MarkLogic and evaluating it to dump huge csv/text data with some transformation like filter based on some condition etc. As far as I know I can dump data using 2 ways.
1) Using java api for MarkLogic in a multi-threaded environment.
2) MLCP with custom transformation.
I would like to know which is the better way to achieve this? Or if there are any other, which I do not know of.
Thanks in advance.
Both ways that you've mentioned will work. One is easier to implement, but you may be able to get better performance from the other.
Using MLCP with a custom transformation should be simple to do. MLCP already knows how to process CSV data and turn it into XML or JSON. With a custom transform, you will get a single XML or JSON doc as input and can alter it as you like. The implementation is pretty straightforward. The caveat is this:
When you use a transform, the batch size is alway set to 1 and the -batch_size option is ignored.
Over a large data set, this will have a noticeable impact on how quickly your data gets loaded. If you don't plan to mess with the URIs in your transform, look into the -fastload
option.
If you use the Java API instead, you'll need to parse the CSV (I'm sure there's a library around), apply the desired transform and do the inserts. That's more code that you'll need to write (and perhaps maintain, if you'll be doing this over time), but you since you'll be inserting a bunch of already-transformed documents, you can insert several documents in a single transaction.
If this is a one-time process, I'd lean toward MLCP unless your content is massive (don't ask me to define massive). If you're going to run this job multiple times over the long run, it's more likely to be worth the effort to do it in your Java layer.
Whichever way you go, it's probably worth reviewing the Designing a Content Loading Strategy section of the Loading Content Into MarkLogic Server guide.