vtd-xml

VTD-XML element fragment incorrect


When parsing a XML document (in UTF-8) containing a special character like © using VTD-XML I now encounter an issue that the returned element fragment (getElementFragment) is not correct.

Example code:

VTDGen vg = new VTDGen();
String xmlDocument =
        "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n" + 
        "<Root>\r\n" + 
        "  <!-- © -->\r\n" + 
        "  <SomeElement/>\r\n" + 
        "</Root>";
// For some reason with US_ASCII it does work, although the file is UTF-8.
vg.setDoc(xmlDocument.getBytes(StandardCharsets.UTF_8));
// True or false doesn't matter here, some result.
vg.parse(false);
// Find the element and its fragment.
VTDNav nv = vg.getNav();
AutoPilot ap = new AutoPilot(nv);
ap.selectXPath("//SomeElement");
while ((ap.evalXPath()) != -1) {
    long elementOffset = nv.getElementFragment();
    int contentStartIndex = (int)elementOffset;
    int contentEndIndex = contentStartIndex + (int)(elementOffset>>32);
    System.out.println("Returned fragment: " + contentStartIndex + ":" + contentEndIndex + ":\n'" + xmlDocument.substring(contentStartIndex, contentEndIndex) + "'");
}

This returns:

Returned fragment: 65:79:
'SomeElement/>
'

While when changing the StandardCharsets.UTF_8 into StandardCharsets.US_ASCII it does work:

Returned fragment: 64:78:
'<SomeElement/>'

When the input file is a UTF-8 file, this leads to incorrect behaviour. Can this be a bug in VTD-XML, or am I doing something wrong here?


Solution

  • The "©" is a two-word unicode char which causes the starting/ending unicode offset to drift from the starting/ending byte offset by 1. This is not a bug... below is the fix

    while ((ap.evalXPath()) != -1) {
                long elementOffset = nv.getElementFragment();
                int contentStartIndex = (int)elementOffset;
                int contentEndIndex = contentStartIndex + (int)(elementOffset>>32);
                System.out.println("Returned fragment: " + contentStartIndex + ":" + contentEndIndex + ":\n'" 
                        + nv.toString(contentStartIndex,(int)(elementOffset>>32)));
                        //+ xmlDocument.substring(contentStartIndex, contentEndIndex) + "'");
            }