xmlxslttei

Automatically add an attribute and values based on Latinized characters between element


I'm using Oxygen XML editor 23.1. I'm working on a large corpus of text and would like to use the transformation to automatically add certain attributes and values to certain elements. In this case, I have a @correspUnic attribute, created to add ugaritic glyphs from unicode decimal. The values of @correspUnic depend on the Latinized characters between the elements. Here's an example of tei encoding:

<w>bn</w>
<g>.</g>
<name>qdš</name>
<w>
  <seg>ʾa</seg>
  <unclear>b̊</unclear>
</w>

Expected result:

<w correspUnic='&#66433;&#66448;'>bn</w>
<g correspUnic='&#66463;'>.</g>
<name correspUnic='&#66454;&#66436;&#66444;'>qdš</name>
<w>
  <seg correspUnic='&#x10380;'>ʾa</seg>
  <unclear correspUnic='&#66433;'>b̊</unclear>
</w>

I have tried several variants of an xsl transformation file, but I confess that after several hours, I close to give up. Here is the last code, which sadly doesn't work:

<!-- Define the str-split function -->
   <xsl:template name="str-split">
      <xsl:param name="input" />
      <xsl:param name="delimiter" select="''" />
      <xsl:choose>
         <xsl:when test="contains($input, $delimiter)">
            <xsl:variable name="first" select="substring-before($input, $delimiter)" />
            <xsl:variable name="rest" select="substring-after($input, $delimiter)" />
            <char>
               <xsl:value-of select="$first" />
            </char>
            <xsl:call-template name="str-split">
               <xsl:with-param name="input" select="$rest" />
               <xsl:with-param name="delimiter" select="$delimiter" />
            </xsl:call-template>
         </xsl:when>
         <xsl:otherwise>
            <char>
               <xsl:value-of select="$input" />
            </char>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>
   
   <!-- Define Unicode data directly in the variable -->
   <xsl:variable name="unicodeData">
      <data>
         <row>
            <latin>ʾa</latin>
            <Unicode>66432</Unicode>
         </row>
         <row>
            <latin>b</latin>
            <Unicode>66433</Unicode>
         </row>
         <row>
            <latin>g</latin>
            <Unicode>66434</Unicode>
         </row>
         <row>
            <latin>ḫ</latin>
            <Unicode>66435</Unicode>
         </row>
         <row>
            <latin>d</latin>
            <Unicode>66436</Unicode>
         </row>
       <!-- etc -->
      </data>
   </xsl:variable>
   
   <xsl:template match="/">
      <!-- Display the value of the variable $unicodeData -->
      <xsl:message select="$unicodeData" />
      
      <xsl:apply-templates/>
   </xsl:template>

   
   <!-- XSLT template for adding @correspUnic to w, g, unclear, name, seg, and supplied -->
   <xsl:template match="w | g | unclear | name | seg | supplied">
      <!-- Copy current element -->
      <xsl:copy>
         <!-- Apply rules to add @correspUnic to children -->
         <xsl:apply-templates select="node()" />
         <!-- Check whether the current element must have @correspUnic -->
         <xsl:if test="self::name or self::seg or self::supplied or self::w or self::g or self::unclear">
            <!-- Recover Latinized characters from textual descendants -->
            <xsl:variable name="latinized">
               <xsl:for-each select="descendant::text()">
                  <xsl:value-of select="." />
               </xsl:for-each>
            </xsl:variable>
            <!-- Check if Latinized characters are detected -->
            <xsl:if test="normalize-space($latinized)">
               <!-- Use the str-split function to split the string -->
               <xsl:variable name="correspUnicode">
                  <xsl:call-template name="str-split">
                     <xsl:with-param name="input" select="$latinized" />
                  </xsl:call-template>
               </xsl:variable>
               <!-- Add @correspUnic attribute with Unicode values -->
               <xsl:attribute name="correspUnic">
                  <xsl:for-each select="$correspUnicode/char">
                     <xsl:variable name="char" select="." />
                     <xsl:if test="normalize-space($char)">
                        <xsl:value-of select="concat('&amp;#', $unicodeData//row[latin = $char]/Unicode, ';')" />
                     </xsl:if>
                  </xsl:for-each>
               </xsl:attribute>
            </xsl:if>
         </xsl:if>
      </xsl:copy>
   </xsl:template>

As you can see, I added xsl:message to see any errors that would have a direct impact on adding the attribute and its values, but nothing...

Thank you very much in advance for your advice and suggestions.


Solution

  • Thanks to Martin who helped me solve the problem of displaying @correspUnic values. On the other hand, there was a problem displaying unicode decimal values of ʾa (66432), ʾi (66459), ʾu (66460) which were probably interpreted as two characters, but this is not the case: in Ugaritic, it is indeed a glyph. To get around the problem, I used regex. Then I had to do some additional processing to replace &amp; with &--which wasn't very simple, given that & is de facto understood as preceding an entity. I'm not saying it is the best solution, but it works.

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       xmlns:xs="http://www.w3.org/2001/XMLSchema"
       exclude-result-prefixes="#all"
       version="3.0">
       
       <xsl:mode on-no-match="shallow-copy"/>
       
       <!-- Define Unicode data directly in the variable -->
       <xsl:param name="unicodeData">
          <data>
             <row>
                <latin>ʾa</latin>
                <Unicode>66432</Unicode>
             </row>
             <row>
                <latin>b</latin>
                <Unicode>66433</Unicode>
             </row>
             <row>
                <latin>g</latin>
                <Unicode>66434</Unicode>
             </row>
             <row>
                <latin>ḫ</latin>
                <Unicode>66435</Unicode>
             </row>
             <row>
                <latin>d</latin>
                <Unicode>66436</Unicode>
             </row>
             <row>
                <latin>h</latin>
                <Unicode>66437</Unicode>
             </row>
             <!-- etc -->
          </data>
       </xsl:param>
       
       <xsl:key name="latin-to-unicode" match="row" use="latin"/>
       
       <xsl:character-map name="ugaritic">
          <xsl:output-character character="&#66432;" string="&amp;#66432;"/>
          <xsl:output-character character="&#66433;" string="&amp;#66433;"/>
          <xsl:output-character character="&#66434;" string="&amp;#66434;"/>
          <xsl:output-character character="&#66435;" string="&amp;#66435;"/>
          <xsl:output-character character="&#66436;" string="&amp;#66436;"/>
          <xsl:output-character character="&#66437;" string="&amp;#66437;"/>
          <!-- etc -->
       </xsl:character-map>
    
     <xsl:output method="xml" use-character-maps="ugaritic"/>
    <!-- for example -->
    <!-- Apply correspUnic attribute only to w elements whose text does not come from child elements unclear, seg, supplied -->
       <xsl:template match="w[(not(child::unclear) and not(child::seg) and not(child::supplied)) and text() and (not(@correspUnic) or string-length(normalize-space(@correspUnic)) = 0)]">
          <xsl:copy>
             <xsl:apply-templates select="@*"/>
             <xsl:attribute name="correspUnic">
                <xsl:apply-templates select="text()" mode="map"/>
             </xsl:attribute>
             <xsl:apply-templates/>
          </xsl:copy>
       </xsl:template>
    
    <xsl:template match="text()" mode="map">
          <xsl:analyze-string select="." regex="ʾ[aiu]">
             <xsl:matching-substring>
                <xsl:variable name="matchedChar" select="." />
                <xsl:variable name="unicodeValue">
                   <xsl:choose>
                      <xsl:when test="$matchedChar = 'ʾa'">66432</xsl:when>
                      <xsl:when test="$matchedChar = 'ʾi'">66459</xsl:when>
                      <xsl:when test="$matchedChar = 'ʾu'">66460</xsl:when>
                   </xsl:choose>
                </xsl:variable>
                <!-- Create a Unicode string at once -->
                <xsl:variable name="unicodeString" select="codepoints-to-string($unicodeValue)"/>
                <!-- remove all &amp; -->
                <xsl:variable name="cleanedString" select="replace($unicodeString, '&amp;', '')"/>
                <xsl:sequence select="$cleanedString"/>
             </xsl:matching-substring>
             <xsl:non-matching-substring>
                <xsl:for-each select="string-to-codepoints(.) ! codepoints-to-string(.)">
                   <xsl:sequence select="key('latin-to-unicode', ., $unicodeData)/Unicode => codepoints-to-string()"/>
                </xsl:for-each>
             </xsl:non-matching-substring>
          </xsl:analyze-string>
       </xsl:template>
       
       
    </xsl:stylesheet>