I have a text node that contains 7-bit ASCII text as well as higher unicode characters (eg x2011, xF0B7, x25CF ...)
I need to be able to (efficiently) convert these single high-unicode characters into processing-instructions
e.g.
‑ -> <processing-instruction name="xxx">character output="hyphen"</pro...>
 -> <processing-instruction name="xxx">character output="page"</pro...>
I've tried using xsl:tokenize
which does split the text before/after the first token delimiter (e.g. x2011) but I end up with a variable containing 'text...<processing-instruction>...</processing-instruction'...text'
which trips up the next xsl:token
.
I managed to get the following approach to work but it looks really inelegant, and I'm sure there's a more efficient/better way to do this but I haven't found anything that works or is any better.
The first character replacement is easy, using replace()
, as I'm only escaping the %
(the target software uses the '%' for other things so needs to be escaped in this manner).
And yes, this would work for the x2011-to-< ... >, but the original intention was to convert to processing-instructions directly.
<xsl:template match="text()">
<xsl:variable name="SR1">
<xsl:value-of select="fn:replace(., '%', '\\%')"/>
</xsl:variable>
<!-- unbreakable hyphen -->
<xsl:variable name="SR2">
<xsl:call-template name="tokenize">
<xsl:with-param name="string" select="$SR1"/>
<xsl:with-param name="delimiter">‑</xsl:with-param>
<xsl:with-param name="PI"><xsl:text><?xpp character symbol="bxhyphen" hex="x2011" data="E28091"?></xsl:text></xsl:with-param>
</xsl:call-template>
</xsl:variable>
<!-- page ref -->
<xsl:variable name="SR3">
<xsl:call-template name="tokenize">
<xsl:with-param name="string" ><xsl:copy-of select="$SR2"/></xsl:with-param>
<xsl:with-param name="delimiter"></xsl:with-param>
<xsl:with-param name="PI"><xsl:text><?xpp character symbol="pgref" hex="xF0B7" data="EF82B7"?></xsl:text>
</xsl:with-param>
</xsl:call-template>
</xsl:variable>
<!-- page ref -->
<xsl:variable name="SR4">
<xsl:call-template name="tokenize">
<xsl:with-param name="string" ><xsl:copy-of select="$SR3"/></xsl:with-param>
<xsl:with-param name="delimiter">●</xsl:with-param>
<xsl:with-param name="PI"><xsl:text><?xpp character symbol="bub" hex="x25CF" data="E2978F"?></xsl:text>
</xsl:with-param>
</xsl:call-template>
</xsl:variable>
<xsl:copy-of select="$SR4"/>
</xsl:template>
Ideally, I was aiming to have a list of 'pairs', the hex unicode and its matching processing-instruction, but any better solution would be appreciated!
Another feature would be to flag characters that have not been processed, so any characters in the ranges x00-x1F, xFF+ (excluding x2011, x25CF xF0B7).
If the characters you are looking for are known and limited I would list them e.g. <xsl:template match="text()"><xsl:analyze-string select="." regex="‑●"><xsl:matching-substring><xsl:processing-instruction name="xxp" select="mf:map(.)"/></xsl:matching-substring><xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring></xsl:analyze-string></xsl:template>
where mf:map
is a function you set up that maps each character to the string you want to output as the data of the pi. In XSLT 3 I would probably store the character to name mapping in an XPath/XSLT map, in XSLT 2 you can use some xsl:param
or xsl:variable
e.g. <xsl:param name="characters-to-name"><map char="‑">bxhyphen</map>...</xsl:param>
and select into that, if needed, even by setting up a key.
Brand new in the world of XSLT/XPath/XQuery is the recently published invisible XML specification (https://invisiblexml.org/) with various (yet to be polished) implementations; using invisible XML you could define a grammar for your format which the processor then only the fly uses to convert it to XML you can process as normally with XSLT/XPath/XQuery.
So with a grammar of e.g.
text: (ascii | hyphen | page | bub)*.
-ascii: ["a"-"z"; "A"-"Z"; "0"-"9"].
hyphen: #2011.
page: #F0B7.
bub: #25CF.
the invisible XML processor would convert an input of e.g. A‑B
into the XML <text>A<hyphen>‑</hyphen>B</text>
you could process then further with XSLT to create whatever specialized output with, for instance, processing instructions you want.