unicodemsxml4

MSXML.DOMDocument.4.0 loadXML with Chinese Unicode characters


Currently, I'm trying to use the MSXML loadXML method in ASP to load XML string which may contain Unicode Chinese characters like

𠮢 (U+20BA2) 4bytes

and the xml string looks like

<City>City</City><Name>𠮢</Name>

So, in my code, I could see the xml string comes in right, but the loadXML returns an an error message like

Invalid unicode characters, & #55362;&#57250

Can someone please tell me what I can do to resolve this issue?

Thanks,

Edited

The code looks like this

    Set objDoc = CreateObject("MSXML2.DOMDocument")
objDoc.async = false
objDoc.setProperty "SelectionLanguage", "XPath"
objDoc.validateOnParse = false
objDoc.loadXML(strXml)  

Solution

  • I suggest posting the exact code, XML source and error message you are getting. I cannot reproduce an error by parsing <element>𠮢</element> in MSXML 4.0 SP3; this works fine.

    I certainly do get a parseError with reason "Invalid unicode character" by trying to parse <element>&#55362;&#57250;</element>, because that's not well-formed XML. If you do have this in your markup then you need to fix the serialiser that produced it because neither MSXML nor any standards-compliant XML parser will load it.

    If 𠮢 is turned into a character reference it must be &#134050; (or &#x20BA2;). Code units 55362 and 57250 are 'surrogates', reserved for encoding astral plane characters in UTF-16. They can't be included in an XML document.