pythonxmlxsltsaxon

Saxon, XSLT: processing thousands of xml files in a complex tree structure


I use a python script that iterates through thousands of xml files in a complex tree structure and execute the following Saxon command:

java -cp C:\saxon\SaxonHE10-6J\saxon-he-10.6.jar net.sf.saxon.Transform -t -s:{input} -xsl:{xslt} -o:{output}

My final output is one txt file ; each line corresponds to one xml file and is a selection of xml element values from it.

This is working well but the performance is very low. I suppose it is because my python script calls the Saxon command each time a new xml file in my iteration is processed.

What would be the right approach to speeding up the process, if possible drastically?

Kind regards.

Excerpt from the python file:

for root, dirs, files in os.walk(folderXmlSource):

    for file in files:
        if file.endswith('.xml'):
            input = '"\\\\?\\' + str(os.path.join(root, file)) + '"'
            output = '"' + os.path.join(folderTxtTemp, file[:-4] + '.txt') + '"'
            try:
                transform(input, output)
                print(input, 'jjjjj', output)
                finalize(output)
            except:
                errorLog.write(input + '\n')

The transform function calls Saxon and processes the XSLT transformation. The finalize function concatenates in the final result file all the results obtained from the XSLT transformation of each xml file.

Excerpt from the XSL file:

<!--  //System:FileName  -->

    <xsl:variable name="System:FileName">
        <xsl:choose>
            <xsl:when test="//System:FileName">
                <xsl:choose>
                    <xsl:when test="//System:FileName !=''">
                        <xsl:value-of select="//System:FileName"/>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:text>System:FileName VIDE</xsl:text>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:when>
            <xsl:otherwise>
                <xsl:text>System:FileName ABSENT</xsl:text>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:variable>

The XSL file looks for specific elements like, for instance, System:FileName. If this element exists, it puts the value of it in a variable. All the variable contents that are obtained from the different elements are then concatenated into a txt file.


Solution

  • I would suggest to try to use SaxonC 12 (e.g. use the PyPi package saxonche) and change the Python code to e.g.

    from saxonche import PySaxonProcessor
    
    def transform(saxon_proc, xslt30_executable, input, output):
        xdm_input = saxon_proc.parse_xml(xml_file_name=input)
        xslt30_executable.set_global_context_item(xdm_item=xdm_input)
        xslt30_executable.apply_templates_returning_file(xdm_value=xdm_input, output_file=output)
    
    
    with PySaxonProcessor() as saxon_proc:
        xslt30_processor = saxon_proc.new_xslt30_processor()
        xslt30_executable = xslt30_processor.compile_stylesheet(stylesheet_file='yourXsltStylesheet.xsl')
    
        for file in files:
            if file.endswith('.xml'):
                input = '"\\\\?\\' + str(os.path.join(root, file)) + '"'
                output = '"' + os.path.join(folderTxtTemp, file[:-4] + '.txt') + '"'
                try:
                    transform(saxon_proc, xslt30_executable, input, output)
                    print(input, 'jjjjj', output)
                    finalize(output)
                except:
                    errorLog.write(input + '\n')`
    

    See whether that alone not already gives a drastic performance improvement.

    You can then also consider to use multithreading with Python and SaxonC, as done in https://github.com/martin-honnen/SaxonC12ThreadPoolExecutorXSLTTransformation, to further improve performance.

    I will look at the XSLT and whether it is possible to delegate all to a single XSLT next.