javaxmlparsingvtd-xmlkanji

Java XML Parsing - incorrect string version of the data with VTD-XML


I am parsing an XML document in UTF-8 encoding with Java using VTD-XML.

A small excerpt looks like:

<literal>𠀋</literal>
<literal>𠂉</literal>
<literal>𠂢</literal>

I want to iterate through each literal and print it out to the console. However, what I get is:

¢

I am correctly navigating to each element. The way that I get the text value is by calling:

private static String toNormalizedString(String name, int val, final VTDNav vn) throws NavException {
    String strValue = null;
    if (val != -1) {
        strValue = vn.toNormalizedString(val);
    }
    return strValue;
}

I've also tried vn.getXPathStringVal();, however it yields the same results.

I know that each of the literals above aren't just strings of length one. Rather, they seem to be unicode "characters" composed of two characters. I am able to correctly parse and output the kanji characters if they're length is just one.

My question is - how can I correctly parse and output these characters using VTD-XML? Is there a way to get the underlying bytes of the text between the literal tags so that I can parse the bytes myself?

EDIT

Code to process each line of the XML - converting it to a byte array and then back to a String.

try (BufferedReader br = new BufferedReader(new FileReader("res/sample.xml"))) {
        String line;
        while ((line = br.readLine()) != null) {
            byte[] myBytes = null;

            try {
                myBytes = line.getBytes("UTF-8");
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
                System.exit(-1);
            }

            System.out.println(new String(myBytes));
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

Solution

  • You are probably trying to get the string involving characters that is greater than 0x10000. That bug is known and is in the process of being addressed... I will notify you once the fix is out. This question may be identical to this one... Map supplementary Unicode characters to BMP (if possible)