I am writing a Porter stemmer in xQuery and as the first step I need to match consonant and vowel patterns. The consonant matching sequence from the Perl example I'm using as a basis for this is (?:[^aiueoy]|(?:(?<=[aiueo])y)|\by)
, and the vowel sequence is (?:[aiueo]|(?:(?<![aiueo])y))
. I need to expand that to also include the letter aesc (æ), and so this is what I have for my xquery regex:
let $v := element {"vowels"} {matches($f,"(?:([^aiueoy])|(?:(?:[aiueo]\1)y))")}
let $c := element {"consonants"} {matches($f,"(?:([aiueo])|(?:(?<![aiueo]\1)y))")}
A sample of the type of XML I am looking for is as follows:
<entry ref="173">
<headword>abǒve</headword>
<headword>abǒven</headword>
<variant>abufe</variant>
<variant>abufen</variant>
<variant>abuue</variant>
<variant>abuuen</variant>
<variant>abowve</variant>
<variant>obove</variant>
<variant>oboven</variant>
<variant>obufe</variant>
<variant>obufen</variant>
<variant>abof</variant>
<variant>obof</variant>
<variant>aboyf</variant>
<variant>aboun</variant>
<variant>aboune</variant>
<variant>abown</variant>
<variant>abowne</variant>
<variant>aboon</variant>
<variant>oboun</variant>
<variant>oboune</variant>
<variant>abow</variant>
<variant>aboʒe</variant>
<part_of_speech> adv. </part_of_speech>
</entry>
Running this in Saxon, however, I get the following error: Query failed with dynamic error: Syntax error at char 17 in regular expression: No expression before quantifier
I'm pretty sure my issue is that I'm not building the positive lookbehind properly, having changed it from <=
to \1
, but I'm not sure how I would build that aspect in a way that works with xQuery. Any suggestions would be much appreciated.
The XQuery 3.1 spec's regular expression support is described at https://www.w3.org/TR/xpath-functions-31/#regex-syntax, noting that XPath and XQuery supports several additions to what the XML Schema Datatypes specification on regular expressions at https://www.w3.org/TR/xmlschema-2/#regexs. Unfortunately, lookbehind support is not part of the specification.
However, since you note that you're using Saxon, Saxon has an extension that allows you to enable native Java regex if you supply the j
flag, as documented at https://www.saxonica.com/html/documentation/functions/fn/matches.html. This should give you access to Java's support for positive lookbehind expressions.
(This j
flag is becoming a sort of extension convention among other XQuery implementations. BaseX follows Saxon, as noted at http://docs.basex.org/wiki/XQuery_Extensions#Regular_Expressions. eXist will likely adopt this convention too: https://github.com/eXist-db/exist/issues/846.)