We're considering moving from PCRE to PCRE2 as our internal regex engine. Only the regex syntax itself is exposed to our users, so the library APIs differences are not an issue to our uses. However, we will have to document any change in behaviour.
Plenty of websites discuss the API differences, but I've not found any that list practical differences there in the regex symtax. While I do know that [\w-_]
means the same as [\w\-_]
in PCRE but is invalid in PCRE2, I suspect other differences exist.
In what ways do the regexes of PCRE2 differ from those of PCRE?
I have compiled a list of changes that are possible issues one could encounter when converting from pcre to pcre2. I have excluded various overflows, underflows, segmentation violations, and assorted errors the pattern could encounter in pcre.
Pcre2 has a version checking pattern. You may check the version in applications with /(?(VERSION>=10)yes|no)/
matching against "yesno".
Patterns such as /()a/
failed to set the "first character must be 'a'" information. For example /(?:(?=.)|(?<!x))a/
.
Patterns such as /a\K.(?0)*/
matching against "abac" found "bac" when Perl and JIT found "c". The effects of \K
was not being propagated correctly. Not all uses of \K
produced incorrect results.
Use of (*ACCEPT)
did not unset other group captures, leaving the ovector containing incorrect information. For example /(x)|((*ACCEPT))/
matched against "abcd".
For a pattern similar to /(?i)[A-`]/
in UTF mode and mixed case could leave ranges out of the class, in this case a-j was left out.
An assertion optimized to (*FAIL)
when used as a condition. For example (?(?!)a|b)
.
For \8
and \9
, now match Perl. They are either a back reference, or the literal characters "8" and "9".
Report an error for an empty sub-pattern name such as (?'')
.
A repeating non-capturing group with conditional groups that matched empty strings failed to be identified as matching the empty string. For example /^(?:(?(1)x|)+)+$()/.
Various breaking changes for EBCDIC environments.
PCRE2 with Unicode support enabled did not report an error when using \p
and \P
in a class.
Possessively repeated conditional groups that may match empty strings were incorrectly compiled. For example /(?(R))*+/
.
Sequences such as [[:punct:]b]
disregarded the POSIX classes if a single character followed.
In UCP mode, [:punct:]
matched characters in 128-255 that should not have matched.
Negated classes such as [^[:^ascii:]\d]
and non-negated classes of [:^ascii:]
or [:^xdigit:]
incorrectly included all code points greater than 255.
Setting any of the (?imsxJU)
options at the start of a pattern is no longer transferred to the options that are returned by PCRE2_INFO_ALLOPTIONS.
Having \Q\E
in the middle of a quantifier such as A+\Q\E+ is now ignored.
An empty \Q\E
sequence may appear after a callout preceding an assertion condition, however it is ignored.
You may now use {0}
after a group in a lookbehind assertion.
PCRE2 now matches perl in treating (?(DEFINE)...)
as a "define" group, even when a group named "DEFINE" exists.
Recursion condition tests must now refer to existing sub-patterns. For example (?(R2)...)
.
Use of conditional recursion test misbehaved if a group name began with "R". For example (?(R)...)
.
A hyphen immediately after a POSIX character class deviates from Perl. It is allowed as a literal, but PCRE2 now generates an error.
Patterns like (?=.*X)X$
were incorrectly optimized as if they required an initial 'X' and a following 'X'.
Assertion starting with .*
were incorrectly optimized to require matching at the start of the subject or after a newline. Some cases were not true, for example (?=.*[A-Z])(?=.{8,16})(?!.*[\s])
.
If the only branch in a conditional sub-pattern is anchored, the whole sub-pattern will incorrectly be treated as anchored. For example /(?(1)^())b/ or /(?(?=^))b/
.
A pattern starting with a subroutine call and a quantifier minimum of zero, will incorrectly set "match must start with this character". For example: /(?&xxx)*ABC(?<xxx>XYZ)/
would expect 'A' to be the first character.