I'm using Oxygen XML editor 23.1. I'm working on a large corpus of text and would like to use the transformation to automatically add certain attributes and values to certain elements. In this case, I have a @correspUnic
attribute, created to add ugaritic glyphs from unicode decimal. The values of @correspUnic
depend on the Latinized characters between the elements. Here's an example of tei
encoding:
<w>bn</w>
<g>.</g>
<name>qdš</name>
<w>
<seg>ʾa</seg>
<unclear>b̊</unclear>
</w>
Expected result:
<w correspUnic='𐎁𐎐'>bn</w>
<g correspUnic='𐎟'>.</g>
<name correspUnic='𐎖𐎄𐎌'>qdš</name>
<w>
<seg correspUnic='𐎀'>ʾa</seg>
<unclear correspUnic='𐎁'>b̊</unclear>
</w>
I have tried several variants of an xsl
transformation file, but I confess that after several hours, I close to give up. Here is the last code, which sadly doesn't work:
<!-- Define the str-split function -->
<xsl:template name="str-split">
<xsl:param name="input" />
<xsl:param name="delimiter" select="''" />
<xsl:choose>
<xsl:when test="contains($input, $delimiter)">
<xsl:variable name="first" select="substring-before($input, $delimiter)" />
<xsl:variable name="rest" select="substring-after($input, $delimiter)" />
<char>
<xsl:value-of select="$first" />
</char>
<xsl:call-template name="str-split">
<xsl:with-param name="input" select="$rest" />
<xsl:with-param name="delimiter" select="$delimiter" />
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<char>
<xsl:value-of select="$input" />
</char>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- Define Unicode data directly in the variable -->
<xsl:variable name="unicodeData">
<data>
<row>
<latin>ʾa</latin>
<Unicode>66432</Unicode>
</row>
<row>
<latin>b</latin>
<Unicode>66433</Unicode>
</row>
<row>
<latin>g</latin>
<Unicode>66434</Unicode>
</row>
<row>
<latin>ḫ</latin>
<Unicode>66435</Unicode>
</row>
<row>
<latin>d</latin>
<Unicode>66436</Unicode>
</row>
<!-- etc -->
</data>
</xsl:variable>
<xsl:template match="/">
<!-- Display the value of the variable $unicodeData -->
<xsl:message select="$unicodeData" />
<xsl:apply-templates/>
</xsl:template>
<!-- XSLT template for adding @correspUnic to w, g, unclear, name, seg, and supplied -->
<xsl:template match="w | g | unclear | name | seg | supplied">
<!-- Copy current element -->
<xsl:copy>
<!-- Apply rules to add @correspUnic to children -->
<xsl:apply-templates select="node()" />
<!-- Check whether the current element must have @correspUnic -->
<xsl:if test="self::name or self::seg or self::supplied or self::w or self::g or self::unclear">
<!-- Recover Latinized characters from textual descendants -->
<xsl:variable name="latinized">
<xsl:for-each select="descendant::text()">
<xsl:value-of select="." />
</xsl:for-each>
</xsl:variable>
<!-- Check if Latinized characters are detected -->
<xsl:if test="normalize-space($latinized)">
<!-- Use the str-split function to split the string -->
<xsl:variable name="correspUnicode">
<xsl:call-template name="str-split">
<xsl:with-param name="input" select="$latinized" />
</xsl:call-template>
</xsl:variable>
<!-- Add @correspUnic attribute with Unicode values -->
<xsl:attribute name="correspUnic">
<xsl:for-each select="$correspUnicode/char">
<xsl:variable name="char" select="." />
<xsl:if test="normalize-space($char)">
<xsl:value-of select="concat('&#', $unicodeData//row[latin = $char]/Unicode, ';')" />
</xsl:if>
</xsl:for-each>
</xsl:attribute>
</xsl:if>
</xsl:if>
</xsl:copy>
</xsl:template>
As you can see, I added xsl:message
to see any errors that would have a direct impact on adding the attribute and its values, but nothing...
Thank you very much in advance for your advice and suggestions.
Thanks to Martin who helped me solve the problem of displaying @correspUnic values. On the other hand, there was a problem displaying unicode decimal values of ʾa (66432), ʾi (66459), ʾu (66460) which were probably interpreted as two characters, but this is not the case: in Ugaritic, it is indeed a glyph. To get around the problem, I used regex
. Then I had to do some additional processing to replace &
with &
--which wasn't very simple, given that &
is de facto understood as preceding an entity. I'm not saying it is the best solution, but it works.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<!-- Define Unicode data directly in the variable -->
<xsl:param name="unicodeData">
<data>
<row>
<latin>ʾa</latin>
<Unicode>66432</Unicode>
</row>
<row>
<latin>b</latin>
<Unicode>66433</Unicode>
</row>
<row>
<latin>g</latin>
<Unicode>66434</Unicode>
</row>
<row>
<latin>ḫ</latin>
<Unicode>66435</Unicode>
</row>
<row>
<latin>d</latin>
<Unicode>66436</Unicode>
</row>
<row>
<latin>h</latin>
<Unicode>66437</Unicode>
</row>
<!-- etc -->
</data>
</xsl:param>
<xsl:key name="latin-to-unicode" match="row" use="latin"/>
<xsl:character-map name="ugaritic">
<xsl:output-character character="𐎀" string="&#66432;"/>
<xsl:output-character character="𐎁" string="&#66433;"/>
<xsl:output-character character="𐎂" string="&#66434;"/>
<xsl:output-character character="𐎃" string="&#66435;"/>
<xsl:output-character character="𐎄" string="&#66436;"/>
<xsl:output-character character="𐎅" string="&#66437;"/>
<!-- etc -->
</xsl:character-map>
<xsl:output method="xml" use-character-maps="ugaritic"/>
<!-- for example -->
<!-- Apply correspUnic attribute only to w elements whose text does not come from child elements unclear, seg, supplied -->
<xsl:template match="w[(not(child::unclear) and not(child::seg) and not(child::supplied)) and text() and (not(@correspUnic) or string-length(normalize-space(@correspUnic)) = 0)]">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<xsl:attribute name="correspUnic">
<xsl:apply-templates select="text()" mode="map"/>
</xsl:attribute>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()" mode="map">
<xsl:analyze-string select="." regex="ʾ[aiu]">
<xsl:matching-substring>
<xsl:variable name="matchedChar" select="." />
<xsl:variable name="unicodeValue">
<xsl:choose>
<xsl:when test="$matchedChar = 'ʾa'">66432</xsl:when>
<xsl:when test="$matchedChar = 'ʾi'">66459</xsl:when>
<xsl:when test="$matchedChar = 'ʾu'">66460</xsl:when>
</xsl:choose>
</xsl:variable>
<!-- Create a Unicode string at once -->
<xsl:variable name="unicodeString" select="codepoints-to-string($unicodeValue)"/>
<!-- remove all & -->
<xsl:variable name="cleanedString" select="replace($unicodeString, '&', '')"/>
<xsl:sequence select="$cleanedString"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:for-each select="string-to-codepoints(.) ! codepoints-to-string(.)">
<xsl:sequence select="key('latin-to-unicode', ., $unicodeData)/Unicode => codepoints-to-string()"/>
</xsl:for-each>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>