If we try to save an XML from Marklogic with the help of xdmp:save
function, it saves the file in the UTF-8 format.
Now, if we try to save the same file with the help of the Marklogic CoRB tool, it saves that file into ANSI format instead of UTF-8.
Why?
Below XQuery code saving the XML file in UTF-8 format XML via Marklogic Qconsole.
xquery version "1.0-ml";
let $data := fn:collection('00-2346447146')/metadata
return xdmp:save("E:\ML_CoRB_Tool\DD-7759627900-test\Report\00-2346447146-1.xml", $data)
While below marklogic CoRB Tool PROCESS-MODULE xquery code saving the same XML file in ANSI format XML:
xquery version "1.0-ml";
declare variable $URI external;
declare variable $SCR-database-name := 'SCR'
let $scr-data:= xdmp:eval('xquery version "1.0-ml";
declare variable $URI external;
let $UPI := fn:replace($URI, ".xml", "")
let $scr-metadata := cts:search(collection("scr-asset"), cts:element-range-query(xs:QName("SAPID"), "=", xs:int($UPI)))
let $assetID := $scr-metadata/metadata/assetIdentifiers/assetIdentifier/AssetID
return
try
{
if ($scr-metadata)
then $scr-metadata
else <doc-not-found>{fn:concat("DOC-NOT-PRESENT for UPI: ", $UPI)}</doc-not-found>
}
catch($x)
{
(
xdmp:log("============Transform error ============="),
xdmp:log($x),
<error>{fn:concat("ERROR in UPI:", $UPI," Assetid: ",$assetID)}</error>
)
}'
, (xs:QName("URI"), $URI),
<options xmlns="xdmp:eval">
<database>{xdmp:database($SCR-database-name)}</database>
</options>
)
return
if ($scr-data/metadata) then $scr-data else ()
We are using below ML CoRB tool Properties:
THREAD-COUNT=8
MODULE-ROOT=/
MODULES-DATABASE=.\\test\\XQuery\\PROD-Metadata
URIS-FILE=.\\test\\Input\\assets_for_extraction_from_scr_20220121.csv
PROCESS-MODULE=.\\test\\XQuery\\new-query.xqy|ADHOC
EXPORT-FILE-DIR=.\\test\\Report
URIS_BATCH_REF='URIS_BATCH_REF'
LOADER-SET-URIS-BATCH-REF=true
EXPORT-FILE-URI-TO-PATH=false
PRE-BATCH-TASK=com.marklogic.developer.corb.PreBatchUpdateFileTask
PROCESS-TASK=com.marklogic.developer.corb.ExportToFileTask
POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask
DECRYPTER=com.marklogic.developer.corb.JasyptDecrypter
The CoRB tasks use the method method getValueAsBytes()
invokes:
item.asString().getBytes();
The Java String getBytes()
method:
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
So, it looks like it should instead explicitly ask for UTF_8
encoded bytes to be written, rather than rely on the platform charset:
item.asString().getBytes(StandardCharsets.UTF_8);
I have filed an issue and get that adjusted.
In the meantime, as @David Ennis has suggested, you can set the default file encoding to UTF-8 by setting the system property -Dfile.encoding=UTF-8
.