regexxmlxslttei

How to use regex in xslt to manipulate text of element while maintain processing of child nodes and their attributes (using TEI stylesheets profile)?


I am currently working on a profile for the TEI xslt Stylesheets (https://tei-c.org/release/doc/tei-xsl/) to customize a transformation from MSword docx format to TEI conform XML (and further on to valid HTML). In my case one specific transformation I need the customization is that I have a bunch of texts that refer to a specific archive of video sources. In the text these references are like [box: 001 roll: 01 start: 00:01:00.00]. I want to use regex to find these references and generate a TEI conform tei:media element within a tei:figure element. This works well when the reference is within its own paragraph. But various authors have references inside their text paragraphs (element tei:p). Here starts the challenge because these pragraphs may contain other elements like tei:note or tei:hi that should be kept intact and processed adequately. Unfortunately the xslt instruction xsl:analyze-string creates substrings and as such text strings you can not use xsl:apply-templates on them, only xsl:copy-of. This works for the xsl:matching-substring but the xsl:non-matching-substring contains as mentioned above some other elements (with attributes) that should be processed.

The TEI Stylesheets transformations are fairly complex and run various passes. At the stage I want to intervene with my profile I have already a tei element p for my paragraphs. E.g.:

<p>This is my paragraph with a note <note place="foot">This is my note</note> and it is <hi rend="italic">important</hi> that this inline elements and their attributes are kept and further processed. This is my special reference to a video in the archive [box: 001 roll: 01 start: 00:01:10.12] that should be transformed into a valid tei:media element.</p>

my transformation so far (simplified):

 <xsl:template match="tei:p" mode="pass2">
  <xsl:choose>
   <xsl:when test=".,'\[[Bb]ox:.+?\]'">
    <xsl:analyze-string select="." regex="\[box: (\d+) roll: (\d+) start: ((\d\d):(\d\d):(\d\d).(\d\d))\]">
     <xsl:matching-substring>
      <xsl:element name="ref">
       <xsl:attribute name="target">
        <xsl:value-of select="concat('https://path-to-video-page/',regex-group(1),'-',regex-group(2),'/',regex-group(4),'-'regex-group(5),'-',regex-group(6),'-',regex-group(7))"/>
       </xsl:attribute>
       <xsl:value-of select="concat('(box: ',regex-group(1),' roll: ',regex-group(2),' @ ',regex-group(4),'h 'regex-group(5),'m ',regex-group(6),'s)')"/>
      </xsl:element>
      
      <figure place="margin">
       <xsl:element name="head">
        <xsl:value-of select="concat('Sequence from box: ',regex-group(1),' roll: ',regex-group(2))"/>
       </xsl:element>
       <xsl:element name="media">
        <xsl:attribute name="mimeType">video/mp4</xsl:attribute>
         <xsl:attribute name="url">
          <xsl:value-of select="concat('https://path-to-video/',regex-group(1),'-',regex-group(2),'.mp4')"/>
         </xsl:attribute>
         <xsl:attribute name="start">
           <xsl:value-of select="regex-group(3)"/>
         </xsl:attribute>
       </xsl:element>
      </figure>
     </xsl:matching-substring>
     <xsl:non-matching-substring>
      <xsl:copy-of select="."/>
     </xsl:non-matching-substring>
    </xsl:analyze-string>  
   <xsl:otherwise>
    <xsl:apply-templates mode="pass2"/>
   </xsl:otherwise>
  </xsl:choose>
  </p>
 </xsl:template>

Results in:

<p>This is my paragraph with a note This is my note and it is important that this inline elements and their attributes are kept and further processed. This is my special reference to a video in the archive <ref target="https://path-to-video-page/001-01/00-01-10-12">(box: 001 roll: 01 @ 00h 01m 10s)</ref>
<figure rend="margin">
   <head rend="none">Sequence from box: 001 roll: 01</head>
   <media mimeType="video/mp4" url="path-to-video/001-01.mp4" start="00:01:10.12"/>
</figure> that should be transformed into a valid tei:media element.</p>

Now I am stuck. Is it possible to manipulate the matching content of the text in the p element with regex while maintaining the "node character" of the non-matching part for further processing? Or am I in a dead-end and should stop mingling with XML for that purpose? The alternative I am thinking of is to leave the references as text in the XML and to post-process the resulting XML/HTML files with a Python-script. But if possible it would be more elegant to do everything in XSLT.

Thanks for any advice Olaf


Solution

  • The solution is quite simple: change the template match to

    xsl:template match="tei:p//text()"
    

    When applied to tei:p xsl:analyze-string breaks the whole element down to a string that can be parsed with regex. Matching only the text node tei:p//text() preservers the rest of the element structure of tei:p and its parent/ancestor/sibling elements. xsl:analyze-string then operates only on the text and keeps the rest to be processed by other templates or the default identity transformation.

    Many tutorials or examples for xsl:analyze-string apply it to the whole element because they only want to extract some information for further processing, leaving the original element behind. If you want to use xsl:analyze-string to change the text of an element that you further use as an element, then it is essential to apply it only to the text node.

    Thanks to @Martin Honnen for this advice in a comment to my question.