marklogic-corb

MarkLogic CORB - How to avoid timeout when run corb


How to avoid CORB timeout when run large batch of data pull over 10 million docs pdf/xml? Do I need to reduce thread-count and batch-size.

uris-module:

let $uris := cts:uris(
(),
(),
cts:and-query((
    cts:collection-query("/sites"),
    cts:field-range-query("cdate","<","2019-10-01"),
    cts:not-query(
        cts:or-query((
            cts:field-word-query("dcax","200"),
            more code...,
            ))
    )
))
return (fn:count($uris), $uris)

process.xqy:

declare variable $URI as xs:string external;
let $uris := fn:tokenize($URI,";")
let $outputJson := "/output/json/"
let $outputPdf := "/output/pdf/"

for $uri1 in $uris
let $accStr := fn:substring-before(fn:substring-after($uri1,"/sites/"),".xml")
let $pdfUri := fn:concat("/pdf/iadb/",$accStr,".pdf")
let $doc := fn:doc($uri1)
let $obj := json:object()
let $_ := map:put($obj,"PaginationOrMediaCount",fn:number($doc/rec/MediaCount))
let $_ := map:put($obj,"Abstract",fn:replace($doc/rec/Abstract/text(),"[^a-zA-Z0-9 ,.\-\r\n]",""))
let $_ := map:put($obj,"Descriptors",json:to-array($doc/rec/Descriptor/text()))
    let $_ := map:put($obj,"FullText",fn:replace($doc/rec/FullText/text(),"[^a-zA-Z0-9 ,.\-\r\n]",""))
let $_ := xdmp:save(
    fn:concat($outputJson,$accStr,".json"),
    xdmp:to-json($obj)
)
let $_ := if (fn:doc-available($pdfUri))
    then xdmp:save(
        fn:concat($outputPdf,$accStr,".pdf"),
        fn:doc($pdfUri)
    )
    else ()

return $URI

Solution

  • It would be easier to diagnose and suggest improvements if you shared the CoRB job options and the code for your URIS-MODULE and PROCESS-MODULE

    The general concept of a CoRB job is that is splits up the work to perform multiple module executions rather than trying to do all of the work in a single execution, in order to avoid timeout issues and excessive memory consumption.

    For instance, if you wanted to download 10 million documents, the URIS-MODULE would select the URIs of all of those documents, and then each URI would be sent to the PROCESS-MODULE, which would be responsible for retrieving it. Depending upon the THREAD-COUNT, you could be downloading several documents at a time but they should all be returning very quickly.

    Is the execution of the URIs module what is timing out, or the process module?

    You can increase the timeout limit from the default limit up to the maximum timeout limit by using: xdmp:request-set-time-limit()

    Generally, the process modules should execute quickly and shouldn't be timing out. One possible reason would be performing too much work in the transform (i.e. setting BATCH-SIZE really large and doing too much at once) or maybe a misconfiguration or poorly written query (i.e. rather than fetching a single doc with the $URI value, performing a search and retrieving all of the docs each time that the process module is executed).