importmarklogicmlcpxml-encoding

MarkLogic Content Pump , content_encoding encoding="US-ASCII"?


MarkLogic is installed on Windows 10 machine.

We are using MarkLogic Content Pump (MLCP) to import data

It is working well with

<?xml version="1.0" encoding="UTF-8"?>

It is showing error while importing non UTF8 encoding i.e.

<?xml version="1.0" encoding="US-ASCII"?>

I looked at MLCP guide and found content_encoding parameter but its not working and throwing error for records contains special characters like ´ δ, “ & so on

ERROR mapreduce.ContentWriter: XDMP-DOCENTITYREF: Invalid entity reference "gamma"

I am passing it as follows

mlcp.bat -content_encoding "US-ASCII"

When i looked at this document, it says "Only UTF-8 is supported."

When i looked at this, it says "The option value must be a character set name accepted by your JVM;"

So i am confused and not sure how to solve this issue and how to set character set in JVM


Solution

  • Thanks grtjn for your reply.

    -xml_repair_level full worked and all records are now committed and no failed records.

    Special characters (with ;) are stored in ML with real character as follows

    I am hoping that this should be acceptable content from business point of view.

    Now only major challenge is to test with garbled characters in millions of xml records.

    Thanks grtjn for your help.