javaxmlvtd-xml

How to improve performance of querying xml file with VTD-XML and XPath?


I am querying XML files with size of around 1 MB(20k+ lines). I am using XPath to describe what I want to get and VTD-XML library to get it. I think that I have some problems with performance.

The problem is, I am making about 5k+ queries to XML file. It takes approximately 16-17 seconds to retrieve all values. I want to ask you, if this is normal performance for such task? How I can improve it?

I am using VTD-XML library with AutoPilot navigation approach which give me opportunity to use XPath. Implementation is as following:

private VTDGen vg = new VTDGen();
private VTDNav vn;
private AutoPilot ap = new AutoPilot();

public void init(String xml) {
    log.info("Creating document");
    xml = xml.replace("<?xml version=\"1.0\"?>", "<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
    byte[] bytes = xml.getBytes(StandardCharsets.UTF_8);
    vg.setDoc(bytes);
    try {
        vg.parse(true);
        vn = vg.getNav();
    } catch (ParseException e) {
        e.printStackTrace();
    }
    log.info("Document created");
}

public String parseXmlOrReturnNull(String query) {
    String xPathStringVal = null;
    try {
        ap.selectXPath(query);
        ap.bind(vn);
        int i = -1;
        while ((i = ap.evalXPath()) != -1) {
            xPathStringVal = vn.getXPathStringVal();
        }
    }catch (XPathEvalException e) {
        e.printStackTrace();
    } catch (NavException e) {
        e.printStackTrace();
    } catch (XPathParseException e) {
        e.printStackTrace();
    }
    return xPathStringVal;
}

My xml files have specific format, they are divided into lot of parts - segments, and my queries are same for all segments(I am querying it in a loop). For example part of xml:

<segment>
    <a>
        <b>value1</b>
        <c>
            <d>value2</d>
            <e>value3</d>
        </c>
    </a>
</segment>
<segment>
    <a>
        <b>value4</b>
        <c>
            <d>value5</d>
            <e>value6</d>
            <f>value6</d>
        </c>
    </a>
</segment>
...

If I want to get value1 in first segment I am using query:

//segment[1]/a/b

for value 4 in second segment

//segment[2]/a/b

etc.

Intuition says a few things: in my approach every query is independent (it doesn't know anything about other query), it means that AutoPilot, my iterator, always starts at the beginning of the file when I want to query it.

My question is: Is there any way to set AutoPilot at the beginning of processing segment? And when I finish querying move AutoPilot to next segment? I think that if my method will start searching value not from the beginning but from specifying point It will be much faster.

Another way is to divide xml file into small xml files (one xml file = one segment) and querying those small xml files.

What do you think guys? Thanks in advance


Solution

  • Minor: The replace is not needed as UTF-8 is the default encoding; only when there is an encoding, one would need to patch it to UTF-8.

    The XPath should only done once, to not start from [0] to the next index.

    If you need a List representation you could use JAXB with annotations.

    An event based primitive parsing without DOM object probably is best (SAXParser).

    Handler handler = new org.xml.sax.helpers.DefaultHandler {
        @Override
        public void startElement(String uri, 
            String localName, String qName, Attributes attributes) throws SAXException {
        }
    
        @Override
        public void endElement(String uri, 
            String localName, String qName) throws SAXException {
        }
    
        @Override
        public void characters(char ch[], int start, int length) throws SAXException {
        }
    };
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser parser = factory.newSAXParser();
    InputStream in = new ByteArrayInputStream(bytes);
    parser.parse(in, handler);