javaxmljaxbstaxrandom-access

Using StAX to create index for XML for quick access


Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?

I have a large XML file and I need to find information in it. This is used in a desktop application and so it should work on systems with few RAM.

So my idea is this: Create an index and then quickly access data from the large file.

I can't just split the file because it's an official federal database that I want to use unaltered.

Using a XMLStreamReader I can quickly find some element and then use JAXB for unmarshalling the element.

    final XMLStreamReader r = xf.createXMLStreamReader(filename, new FileInputStream(filename));
    final JAXBContext ucontext = JAXBContext.newInstance(Foo.class);
    final Unmarshaller unmarshaller = ucontext.createUnmarshaller();
    r.nextTag();

    while (r.hasNext()) {

        final int eventType = r.next();
        if (eventType == XMLStreamConstants.START_ELEMENT && r.getLocalName().equals("foo")
                && Long.parseLong(r.getAttributeValue(null, "bla")) == bla
                ) {
            // JAX-B works just fine:
            final JAXBElement<Foo> foo = unmarshaller.unmarshal(r,Foo.class);
            System.out.println(foo.getValue().getName());
            // But how do I get the offset?
            // cache.put(r.getAttributeValue(null, "id"), r.getCursor()); // ???
            break;
        }
    }

But I can't get the offset. I'd like to use this to prepare an index:
(id of element) -> (offset in file)

Then I should be able use the offset to just unmarshall from there: Open file stream, skip that many bytes, unmarshall. I can't find a library that does this. And I can't do it on my own without knowing the position of the file cursor. The javadoc clearly states that there is a cursor, but I can't find a way of accessing it.


Edit:
I'm just trying to offer a solution that will work on old hardware so people can actually use it. Not everyone can afford a new and powerful computer. Using StAX I can get the data in about 2 seconds, which is a bit long. But it doesn't require RAM. It requires 300 MB of RAM to just use JAX-B. Using some embedded db system would just be a lot of overhead for such a simple task. I'll use JAX-B anyway. Anything else would be useless for me since the wsimport-generated classes are already perfect. I just don't want to load 300 MB of objects when I only need a few.

I can't find a DB that just needs an XSD to create an in-memory DB, which doesn't use that much RAM. It's all made for servers or it's required to define a schema and map the XML. So I assume it just doesn't exist.


Solution

  • You could work with a generated XML parser using ANTLR4.

    The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.

    1. Get XML Grammar

    cd /tmp
    git clone https://github.com/antlr/grammars-v4
    

    2. Generate Parser

    cd /tmp/grammars-v4/xml/
    mvn clean install
    

    3. Copy Generated Java files to your Project

    cp -r target/generated-sources/antlr4 /path/to/your/project/gen
    

    4. Hook in with a Listener to collect character offsets

    package stack43366566;
    
    import java.util.ArrayList;
    import java.util.List;
    
    import org.antlr.v4.runtime.ANTLRFileStream;
    import org.antlr.v4.runtime.CommonTokenStream;
    import org.antlr.v4.runtime.tree.ParseTreeWalker;
    
    import stack43366566.gen.XMLLexer;
    import stack43366566.gen.XMLParser;
    import stack43366566.gen.XMLParser.DocumentContext;
    import stack43366566.gen.XMLParserBaseListener;
    
    public class FindXmlOffset {
    
        List<Integer> offsets = null;
        String searchForElement = null;
    
        public class MyXMLListener extends XMLParserBaseListener {
            public void enterElement(XMLParser.ElementContext ctx) {
                String name = ctx.Name().get(0).getText();
                if (searchForElement.equals(name)) {
                    offsets.add(ctx.start.getStartIndex());
                }
            }
        }
    
        public List<Integer> createOffsets(String file, String elementName) {
            searchForElement = elementName;
            offsets = new ArrayList<>();
            try {
                XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
                CommonTokenStream tokens = new CommonTokenStream(lexer);
                XMLParser parser = new XMLParser(tokens);
                DocumentContext ctx = parser.document();
                ParseTreeWalker walker = new ParseTreeWalker();
                MyXMLListener listener = new MyXMLListener();
                walker.walk(listener, ctx);
                return offsets;
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }
    
        public static void main(String[] arg) {
            System.out.println("Search for offsets.");
            List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
                            "page");
            System.out.println("Offsets: " + offsets);
        }
    
    }
    

    5. Result

    Prints:

    Offsets: [2441, 10854, 30257, 51419 ....

    6. Read from Offset Position

    To test the code I've written class that reads in each wikipedia page to a java object

    @JacksonXmlRootElement
    class Page {
       public Page(){};
       public String title;
    }
    

    using basically this code

    private Page readPage(Integer offset, String filename) {
            try (Reader in = new FileReader(filename)) {
                in.skip(offset);
                ObjectMapper mapper = new XmlMapper();
                 mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
                Page object = mapper.readValue(in, Page.class);
                return object;
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }
    

    Find complete example on github.