xsltxml-parsingsaxonsaxparserxslt-3.0

entity translation to customized entity


There are some user defined entites in the xml data. In order to unescape those entities, we are using below code:-

<xsl:stylesheet version='3.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' >
<xsl:output method="xml" omit-xml-declaration="no" use-character-maps="mdash" />
<xsl:character-map name="mdash">
<xsl:output-character character="&#x2014;" string="&amp;mdash;"/>
<xsl:output-character character="&amp;" string="&amp;amp;" />
<xsl:output-character character="&quot;" string="&amp;quot;" />
<xsl:output-character character="&apos;" string="&amp;apos;" />
<xsl:output-character character="&#167;" string="&amp;sect;"/>
<xsl:output-character character="&#36;" string="&amp;dollar;" />
<xsl:output-character character="&#47;" string="&amp;sol;" />
<xsl:output-character character="&#45;" string="&amp;hyphen;" />
</xsl:character-map>
<!--=================================================================-->
<xsl:template match="@* | node()">
<!--=================================================================-->
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

But there is a special case where &sect; is appearing twice in data, for example:-

Ex- The number &sect;&sect; 1234

The above should example should be converted to a special userdefined entity i.e.

Output- The number &multisect; 1234

The &sect;&sect; should be converted to &multisect;


Solution

  • If you want to use a character map, you would first need to process text nodes where you expect the two sect characters to be present and replace them with a single character you don't expect to be used elsewhere; that character could then be converted by the map to the string &multisect; e.g. the stylesheet

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        xmlns:fn="http://www.w3.org/2005/xpath-functions"
        exclude-result-prefixes="#all"
        expand-text="yes"
        version="3.0">
      
      <xsl:param name="multisect-sub" static="yes" as="xs:string" select="'«'"/>
      
      <xsl:character-map name="sub">
        <xsl:output-character _character="{$multisect-sub}" string="&amp;multisect;"/>
      </xsl:character-map>
    
      <xsl:mode on-no-match="shallow-copy"/>
    
      <xsl:output method="xml" indent="yes" use-character-maps="sub"/>
      
      <xsl:template match="text()">
        <xsl:apply-templates mode="analyze" select="analyze-string(., '&#xA7;&#xA7;')"/>
      </xsl:template>
      
      <xsl:template mode="analyze" match="fn:match">
        <xsl:text>{$multisect-sub}</xsl:text>
      </xsl:template>
    
    </xsl:stylesheet>
    

    transforms the input

    <!DOCTYPE text [
      <!ENTITY sect "&#xA7;">
    ]>
    <text>&sect;&sect; 1234</text>
    

    into the output

    <?xml version="1.0" encoding="UTF-8"?>
    <text>&multisect; 1234</text>
    

    Note that I used '«' primarily as an example, you might want to need to use a private char or some other character you are sure doesn't occur in your input/output data.

    If you want the result to be well-formed you would also need to add a doctype to the output with e.g. xsl:output doctype-system="some.dtd" where you ensure that some.dtd declares e.g. <!ENTITY multisect "&#xA7;&#xA7;">