I am in process to optimize an existing XML to XML transformation process in terms of memory consumption. We are transforming large multi-GB XML files into a much smaller internal XML structure - the result is less than 10% in size. The transformation is implemented in 4 different XSLT-stages - basically we do:
sourcefile.xml -> xslt1 -> xslt2 -> xslt3 -> xslt4 -> targetfile.xml
This was implemented as a non-streamable chained transformation using the SAX-api with saxon. By now all 4 transformations have been adjusted to be burst-streamable. As part of it we have also changed the calling Java implementation to use the saxon s9api with Xslt30Transformer (test have been done using Saxon 10.6).
We are seeing the following pattern with a test sourcefile of 500 MB.
trafo.asDocumentDestination(nextTrafo)
we need 4GB of memory to run it, otherwise it stops with "GC Overhead limit exceeded".We can reproduce this with only chaining 2 of the 4 trafos:
We are asking ourselves whether this is to be expected and normal, as with chained streaming there might be a need to buffer the data between the trafos? or is there an issue in our implementation...
So we could obviously save to disk between the trafos and have 4 separate steps each consuming 200MB - but is this really optimal with streaming to disk inbetween the trafos?
Is this behaviour to be expected or is there something wrong in our implementation?
--- edited, added more information that seems to indicate that streaming is working also in the chained case:
I will now try with newest saxon 12.x release (was using 10.6 for tests as the customer is still using this in prod currently)
If this does not change anything, will then reduce / isolate the case and launch a support ticket as Michael suggested below.
our (simplified) code looks like this:
// trafo1234 are Xslt30Transformer we got using xsltCompiler.compile().load30()
Serializer finalDest = trafo4.newSerializer(Files.newOutputStream(outFile));
StreamSource input = Files.newInputStream(inFile);
trafo1.applyTemplates(input,
trafo2.asDocumentDestination(
trafo3.asDocumentDestination(
trafo4.asDocumentDestination(finalDest))));
PS: is there an easy way to see/measure the memory-needs of a transformation? So far the only way I found was to play with -Xmx until I got an OOM-Exception
For your test sample, I have tried to use JAXP StreamingTransformerFactory
instead of s9api, just to check whether that way the accumulators work and whether there is no out of memory exception.
It indeed appears, that, as a workaround, you could try to do the streamed chaining through JAXP and StreamingTransformerFactory
, a sample Java code for two stylesheets is e.g.
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXTransformerFactory;
import com.saxonica.config.StreamingTransformerFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.Source;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.sax.SAXResult;
import javax.xml.transform.sax.SAXSource;
import org.xml.sax.InputSource;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.SAXException;
import java.io.File;
public class SimpleTransformAndSplitJAXP {
public static void main(String[] args) throws TransformerException, SAXException {
SAXTransformerFactory transformerFactory = new StreamingTransformerFactory();
transformerFactory.setAttribute("http://saxon.sf.net/feature/timing", true);
Transformer transformer1 = transformerFactory.newTransformer(new StreamSource(args[1]));
Transformer transformer2 = transformerFactory.newTransformer(new StreamSource(args[2]));
transformer2.transform(
new SAXSource(
new Transformer1OutputReader(transformer1, new StreamSource(args[0])),
null
),
new StreamResult(new File(args[3]))
);
}
}
class Transformer1OutputReader extends XMLFilterImpl {
Transformer transformer1;
Source source1;
Transformer1OutputReader(Transformer transformer1, Source source1) {
this.transformer1 = transformer1;
this.source1 = source1;
}
void parseImpl() {
try {
transformer1.transform(source1, new SAXResult(getContentHandler()));
}
catch (TransformerException e) {
e.printStackTrace();
}
}
@Override
public void parse(InputSource input) {
parseImpl();
}
@Override
public void parse(String systemId) {
parseImpl();
}
@Override
public void setFeature(String name, boolean value) {
}
@Override
public void setProperty(String name, Object value) {
}
}
to not have to set/change initial modes I have slightly rewritten the two stylesheets:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:mode streamable="yes" on-no-match="shallow-copy" />
<xsl:mode name="grounded" streamable="no" on-no-match="shallow-copy" />
<xsl:template match="ExportContent" mode="#all">
<xsl:message>starting trafo xslt1</xsl:message>
<ExpCont>
<xsl:apply-templates select="copy-of(Document)" mode="grounded"/>
</ExpCont>
</xsl:template>
<xsl:template mode="grounded" match="Document">
<TransformedDoc>
<xsl:apply-templates select="*" mode="#current"/>
</TransformedDoc>
</xsl:template>
</xsl:stylesheet>
and
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
xmlns:saxon="http://saxon.sf.net/"
exclude-result-prefixes="xs math saxon"
version="3.0">
<xsl:mode streamable="yes" on-no-match="shallow-copy" use-accumulators="#all"/>
<xsl:mode name="grounded" streamable="no" on-no-match="shallow-copy"/>
<xsl:accumulator name="ExportHeader" initial-value="()" streamable="yes" as="element(ExportHeader)*">
<xsl:accumulator-rule phase="end" saxon:capture="yes" match="ExportHeader" select="."/>
</xsl:accumulator>
<xsl:accumulator name="DocCount" initial-value="0" streamable="yes" as="xs:integer">
<xsl:accumulator-rule match="TransformedDoc" select="$value + 1"/>
</xsl:accumulator>
<xsl:template match="ExpCont" mode="#all">
<xsl:message>starting second xslt2</xsl:message>
<ExpCont>
<xsl:apply-templates select="copy-of(TransformedDoc)" mode="grounded"/>
</ExpCont>
<xsl:message select="concat('transformed Doc:', accumulator-after('DocCount'))"/>
</xsl:template>
<xsl:template mode="grounded" match="TransformedDoc">
<Doc>
<xsl:attribute name="PrintDate" select="accumulator-before('ExportHeader')/PrintDate"/>
<xsl:apply-templates select="*" mode="#current"/>
</Doc>
<xsl:message select="concat('done: Doc: ',accumulator-after('DocCount'))"/>
</xsl:template>
</xsl:stylesheet>
That way, it seems, the streaming works with the memory constraint of -Xmx200M
and the accumulators also work e.g. the attribute PrintDate="2024-12-24"
in the result is filled.
For chaining more stylesheets with JAXP, you might want to look into the example https://saxonica.plan.io/projects/saxonmirrorhe/repository/he/revisions/he_mirror_saxon_12_5/entry/src/samples/java/he/JAXPExamples.java#L695, the following is a sample console program expecting the source document as the first command line argument, the result URI as the second and the remaining arguments to be XSLT stylesheets:
import javax.xml.transform.sax.SAXTransformerFactory;
import com.saxonica.config.StreamingTransformerFactory;
import javax.xml.transform.TransformerException;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.sax.SAXResult;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.xml.sax.XMLReader;
import org.xml.sax.XMLFilter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.IOException;
public class ChainXMLFilters {
public static void main(String[] args) throws ParserConfigurationException, IOException, TransformerException, SAXException {
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
saxFactory.setNamespaceAware(true);
XMLReader xmlReader = saxFactory.newSAXParser().getXMLReader();
StreamingTransformerFactory transformerFactory = new StreamingTransformerFactory();
transformerFactory.setAttribute("http://saxon.sf.net/feature/timing", true);
SAXTransformerFactory saxTransformerFactory = (SAXTransformerFactory)transformerFactory;
XMLFilter[] xmlFilters = new XMLFilter[args.length - 2];
for (int i = 2; i < args.length; i++) {
xmlFilters[i - 2] = saxTransformerFactory.newXMLFilter(new StreamSource(args[i]));
}
xmlFilters[0].setParent(xmlReader);
for (int i = 1; i < xmlFilters.length; i++) {
xmlFilters[i].setParent(xmlFilters[i - 1]);
}
TransformerHandler resultSerializer = saxTransformerFactory.newTransformerHandler();
resultSerializer.setResult(new StreamResult(args[1]));
xmlFilters[xmlFilters.length - 1].setContentHandler(resultSerializer);
xmlFilters[xmlFilters.length - 1].parse(new InputSource(new File(args[0]).toURI().toString()));
}
}
Perhaps this helps as a workaround until https://saxonica.plan.io/issues/6637 is resolved and there is a 12.6 release with a fix.