regexxslttokenize

XSLT tokenize with regular expression to only tokenize if the semi-colon is not followed by a space and a number


I am trying to tokenize this string to create separate entries for each bibliographic citation. The catch is that sometimes the semi-colon separate a bibliographic entry and sometimes it separates page numbers. I want to write the tokenizer to only tokenize if the semi-colon is not followed by a space and a number. What I have below, sort of works, but it cuts of the first letter of each citation. (I'm using XSLT 2.0)

Input:

  <zotero>(Leppin 2019; Francisco 2011, 119; van Ginkel 2005, 43–44; 1995, 114–115; 126; 147; 166–67)</zotero>

XSLT:

<xsl:for-each select="tokenize(zotero,';\s[^\d]')">
 <bibl><xsl:value-of select="."/></bibl>
</xsl:for-each>

Current Output:

<bibl>(Leppin 2019</bibl>
<bibl>rancisco 2011, 119</bibl>
<bibl>an Ginkel 2005, 43–44; 1995, 114–115; 126; 147; 166–67)</bibl>

Solution

  • I want to write the tokenizer to only tokenize if the semi-colon is not followed by a space and a number

    With a negative lookahead that would be expressed as

      <xsl:template match="zotero">
        <xsl:for-each select="tokenize(., ';(?! [0-9])', ';j')">
          <bib>{.}</bib>
        </xsl:for-each>
      </xsl:template>
    

    That ;j flags works with Saxon Java, SaxonC, Saxon .NET, SaxonCS and SaxonJS to switch to from standards XPath regular expression to the platform supported.