xmlxsltxslt-2.0invisible-xml

XSLT 2 - convert single characters to processing-instruction


I have a text node that contains 7-bit ASCII text as well as higher unicode characters (eg x2011, xF0B7, x25CF ...)

I need to be able to (efficiently) convert these single high-unicode characters into processing-instructions

e.g.

&#x2011;  ->   <processing-instruction name="xxx">character output="hyphen"</pro...>
&#xF0B7;  ->   <processing-instruction name="xxx">character output="page"</pro...>

I've tried using xsl:tokenize which does split the text before/after the first token delimiter (e.g. x2011) but I end up with a variable containing 'text...<processing-instruction>...</processing-instruction'...text' which trips up the next xsl:token.

I managed to get the following approach to work but it looks really inelegant, and I'm sure there's a more efficient/better way to do this but I haven't found anything that works or is any better.

The first character replacement is easy, using replace(), as I'm only escaping the % (the target software uses the '%' for other things so needs to be escaped in this manner).

And yes, this would work for the x2011-to-< ... >, but the original intention was to convert to processing-instructions directly.

    <xsl:template match="text()">
        <xsl:variable name="SR1">
            <xsl:value-of select="fn:replace(., '%', '\\%')"/>
        </xsl:variable>
        <!-- unbreakable hyphen -->
        <xsl:variable name="SR2">
            <xsl:call-template name="tokenize">
                <xsl:with-param name="string" select="$SR1"/>
                <xsl:with-param name="delimiter">&#x2011;</xsl:with-param>
                <xsl:with-param name="PI"><xsl:text>&lt;?xpp character symbol="bxhyphen" hex="x2011" data="E28091"?&gt;</xsl:text></xsl:with-param>
            </xsl:call-template>
        </xsl:variable>
        <!-- page ref -->
        <xsl:variable name="SR3">
            <xsl:call-template name="tokenize">
                <xsl:with-param name="string" ><xsl:copy-of select="$SR2"/></xsl:with-param>
                <xsl:with-param name="delimiter">&#xF0B7;</xsl:with-param>
                <xsl:with-param name="PI"><xsl:text>&lt;?xpp character symbol="pgref" hex="xF0B7" data="EF82B7"?&gt;</xsl:text>
                </xsl:with-param>
            </xsl:call-template>
        </xsl:variable>
        <!-- page ref -->
        <xsl:variable name="SR4">
            <xsl:call-template name="tokenize">
                <xsl:with-param name="string" ><xsl:copy-of select="$SR3"/></xsl:with-param>
                <xsl:with-param name="delimiter">&#x25CF;</xsl:with-param>
                <xsl:with-param name="PI"><xsl:text>&lt;?xpp character symbol="bub" hex="x25CF" data="E2978F"?&gt;</xsl:text>
                </xsl:with-param>
            </xsl:call-template>
        </xsl:variable>
        <xsl:copy-of select="$SR4"/>
    </xsl:template>

Ideally, I was aiming to have a list of 'pairs', the hex unicode and its matching processing-instruction, but any better solution would be appreciated!

Another feature would be to flag characters that have not been processed, so any characters in the ranges x00-x1F, xFF+ (excluding x2011, x25CF xF0B7).


Solution

  • If the characters you are looking for are known and limited I would list them e.g. <xsl:template match="text()"><xsl:analyze-string select="." regex="&#x2011;&#xF0B7;&#x25CF;"><xsl:matching-substring><xsl:processing-instruction name="xxp" select="mf:map(.)"/></xsl:matching-substring><xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring></xsl:analyze-string></xsl:template> where mf:map is a function you set up that maps each character to the string you want to output as the data of the pi. In XSLT 3 I would probably store the character to name mapping in an XPath/XSLT map, in XSLT 2 you can use some xsl:param or xsl:variable e.g. <xsl:param name="characters-to-name"><map char="&#x2011;">bxhyphen</map>...</xsl:param> and select into that, if needed, even by setting up a key.

    Brand new in the world of XSLT/XPath/XQuery is the recently published invisible XML specification (https://invisiblexml.org/) with various (yet to be polished) implementations; using invisible XML you could define a grammar for your format which the processor then only the fly uses to convert it to XML you can process as normally with XSLT/XPath/XQuery.

    So with a grammar of e.g.

    text: (ascii | hyphen | page | bub)*.
    -ascii: ["a"-"z"; "A"-"Z"; "0"-"9"].
    hyphen: #2011.
    page: #F0B7.
    bub: #25CF.
    

    the invisible XML processor would convert an input of e.g. A‑B into the XML <text>A<hyphen>‑</hyphen>B</text> you could process then further with XSLT to create whatever specialized output with, for instance, processing instructions you want.