I am trying to tokenize this string to create separate entries for each bibliographic citation. The catch is that sometimes the semi-colon separate a bibliographic entry and sometimes it separates page numbers. I want to write the tokenizer to only tokenize if the semi-colon is not followed by a space and a number. What I have below, sort of works, but it cuts of the first letter of each citation. (I'm using XSLT 2.0)
Input:
<zotero>(Leppin 2019; Francisco 2011, 119; van Ginkel 2005, 43–44; 1995, 114–115; 126; 147; 166–67)</zotero>
XSLT:
<xsl:for-each select="tokenize(zotero,';\s[^\d]')">
<bibl><xsl:value-of select="."/></bibl>
</xsl:for-each>
Current Output:
<bibl>(Leppin 2019</bibl>
<bibl>rancisco 2011, 119</bibl>
<bibl>an Ginkel 2005, 43–44; 1995, 114–115; 126; 147; 166–67)</bibl>
I want to write the tokenizer to only tokenize if the semi-colon is not followed by a space and a number
With a negative lookahead that would be expressed as
<xsl:template match="zotero">
<xsl:for-each select="tokenize(., ';(?! [0-9])', ';j')">
<bib>{.}</bib>
</xsl:for-each>
</xsl:template>
That ;j
flags works with Saxon Java, SaxonC, Saxon .NET, SaxonCS and SaxonJS to switch to from standards XPath regular expression to the platform supported.