javaxmlserializationmarshallingstax

Efficient way to reduce memory (RAM) consumption while writing huge file data into XML


I have to write 7 list into XML file, and each list would be of size 1 GB to 5 GB.

Expected out XML file is as follows:

<doc>
    <items1>
        <itemA>..</itemA>
        ..
    </items1>

    <items2>
        <itemB>..</itemB>
        ..
    </items2>

    <items3>
        <itemC>..</itemC>
        ..
    </items3>
    .
    .
    .
    <items7>
        <itemG>..</itemG>
        ..
    </items7>
</doc>  

Java objects are like:

List<ItemA> items1 = new List<>(); // 1GB-5GB
List<ItemB> items2 = new List<>(); // 1GB-5GB
List<ItemC> items3 = new List<>(); // 1GB-5GB
List<ItemD> items4 = new List<>(); // 1GB-5GB
List<ItemE> items5 = new List<>(); // 1GB-5GB
List<ItemF> items6 = new List<>(); // 1GB-5GB
List<ItemG> items7 = new List<>(); // 1GB-5GB

Wrapping all list into a single object(catalogue) into a Java object, and marshalling in one go consumes lots of memory, also every time when this list size increases we have to scale our infra. Below is the code:

JAXBContext.newInstance("ta").createMarshaller().marshal(new ObjectFactory().createCatalogue(catalogue), new FileOutputStream(fileName));

Here catalogue is a Java object containing all the seven list.

Is there any smart way where can I reduce memory consumption, by writing data in chunks. I explored stax for this, but I was not able to find method to write list of data.

Is there any way in Java to write up to 20 GB in an efficient manner into XML, without scaling RAM over infra?

We want to write each list separately, also previously written file should not be load into heap while writing next list.


Solution

  • Using StAX is most likely the best way, not only because you don't have to keep the whole XML document in memory, but because you also don't have to keep all items in memory. Don't know where you looked for writing with StAX but I found the following in The Java EE 5 Tutorial:

    The following example, taken from the StAX specification, shows how to instantiate an output factory, create a writer, and write XML output:

    XMLOutputFactory output = XMLOutputFactory.newInstance();
    XMLStreamWriter writer = output.createXMLStreamWriter( ... );
    writer.writeStartDocument(); 
    writer.setPrefix("c","http://c");
    writer.setDefaultNamespace("http://c");
    writer.writeStartElement("http://c","a");
    writer.writeAttribute("b","blah");
    writer.writeNamespace("c","http://c");
    writer.writeDefaultNamespace("http://c");
    writer.setPrefix("d","http://c");
    writer.writeEmptyElement("http://c","d");
    writer.writeAttribute("http://c","chris","fry");
    writer.writeNamespace("d","http://c"); 
    writer.writeCharacters("Jean Arp"); 
    writer.writeEndElement(); 
    writer.flush(); 
    

    This code generates the following XML (new lines are non-normative):

    <?xml version=’1.0’ encoding=’utf-8’?> 
    <a b="blah" xmlns:c="http://c" xmlns="http://c">
      <d:d d:chris="fry" xmlns:d="http://c"/>
      Jean Arp
    </a> 
    

    Edit: I also notice that there's a section on generating XML with StAX in the link you posted. Also, note that there's nothing special about "writing a list", you just iterate over the list and write one tag per entry. Something like this:

    XMLStreamWriter writer = ...;
    writer.writeStartDocument();
    writer.writeStartElement("doc");
    
    // Write the first list:
    writer.writeStartElement("items1");
    for (ItemA e: items1) {
      writer.writeStartElement("itemA");
      // TODO: Write attributes, sub-elements, text or whatever is needed
      writer.writeEndElement();
    });
    writer.writeEndElement();
    
    // TODO: Write items2, items3, ..., items7 in the same fashion as items1
    
    // Close document
    writer.writeEndElement();
    writer.writeEndDocument();
    

    The XMLStreamWriter is low-level, which means that it doesn't do much more for you other than write XML to a stream, but it's not complicated. So you may end up with quite a few more lines of code than when using JAXB, but the code you have to write won't be particularly hard to write.