pythonpython-3.xxmlxml-parsing

Python Replace XML Text with Escape Sequence


I have a third-party application that is parsing magic strings within an XML file, even though it should be treating them as character literals. As an example, suppose my XML contained the following segment:

<element>Sentence containing magicString</element>

To prevent the third party application from parsing magicString as a command, I want to convert this xml fragment to:

<element>Sentence containing &#109;agicString</element>

How can I achieve this in Python, without doing a global find-replace (e.g., there may be elements named magicString that cannot be renamed or the XML is invalid)? The following illustrates what I have attempted:

from xml.etree import ElementTree
xml = ElementTree.parse(xmlPath)
element = xml.find('.//grandparent/parent/element'):
element.text = '&#109;agicString'
xml.write(xmlPath)

The problem is that assigning to the Element.text property escapes the text, so the result in an XML file with the following contents:

<element>&amp;#109;agicString</element>

Solution

  • Here is XSLT based solution.

    It will look just and only for text nodes containing a "magicString". XML elements are not affected.

    Ampersand shall be always entitized as &amp; in the XML. Otherwise, XML would be not well-formed.

    You can download Saxon Python XSLT 3.0 engine here: Python downloads

    Saxon Home Edition (HE) is free of charge.

    Input XML

    <root>
        <magicString>Another sentence</magicString>
        <element>Sentence containing magicString</element>
        <city>Miami</city>
    </root>
    

    XSLT 3.0

    <?xml version="1.0"?>
    <xsl:stylesheet version="3.0"
                    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" omit-xml-declaration="yes"
                    encoding="UTF-8" indent="yes"/>
        <xsl:strip-space elements="*"/>
    
        <xsl:param name="findMe" select="'magicString'"/>
    
        <!--Identity transform-->
        <xsl:mode on-no-match="shallow-copy"/>
    
        <xsl:template match="text()[contains(., $findMe)]">
            <xsl:value-of select="replace(., $findMe, concat('&amp;#109;', substring($findMe, 2, 100)))"/>
        </xsl:template>
    </xsl:stylesheet>
    

    Output XML

    <root>
      <magicString>Another sentence</magicString>
      <element>Sentence containing &amp;#109;agicString</element>
      <city>Miami</city>
    </root>
    

    Python

    from saxonche import *
    
    output_file = 'output.xml'
    
    with PySaxonProcessor(license=False) as proc:
        print(proc.version)
        try:
            xsltproc = proc.new_xslt30_processor()
            document = proc.parse_xml(xml_file_name='input.xml')
            executable = xsltproc.compile_stylesheet(stylesheet_file="process.xslt")
    
            output = executable.transform_to_string(xdm_node=document)
    
            with open(output_file, 'wb') as f:
                f.write(output.encode('utf-8'))
    
        except PySaxonApiError as err:
            print('Error during function call', err)
    

    Here's a modified version of the XSLT that uses a different character (in this case &#178; (superscript 2)) and a character map to map that to the entity reference you wanted. You just need to use a character that won't exist already in your input so it doesn't accidentally get replaced.

    <xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" expand-text="yes">
        <xsl:output indent="yes" use-character-maps="magic"/>
        <xsl:strip-space elements="*"/>
        
        <xsl:character-map name="magic">
            <xsl:output-character character="&#178;" string="&amp;#109;"/>
        </xsl:character-map>
        
        <xsl:param name="findMe" select="'magicString'"/>
        
        <xsl:mode on-no-match="shallow-copy"/>
        
        <xsl:template match="text()[contains(., $findMe)]">
            <xsl:value-of select="replace(., $findMe, '&#178;'||substring($findMe, 2))"/>
        </xsl:template>
        
    </xsl:stylesheet>
    

    This produces:

    <root>
       <magicString>Another sentence</magicString>
       <element>Sentence containing &#109;agicString</element>
       <city>Miami</city>
    </root>
    

    Screenshot of Python/XSLT in PyCharm producing desired output:

    enter image description here