I have this regex:
regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/i
And when I use it on some, but not all, texts e.g. this one:
text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ"
like so: text.match(regex)
, then ruby just runs in what seems like an infinite loop - but why? And is there anyway to guard against this, e.g. by having ruby throw an exception instead - without using the Timeout
as it is a known issue when using it with Sidekiq (https://github.com/mperham/sidekiq/wiki/Problems-and-Troubleshooting#add-timeouts-to-everything)
ruby version: 2.7.2
Built-in character classes are more table-driven.
Given that, Negative built-in ones like \W
, \S
etc...
are difficult for engines to merge into a positive character class.
In this case, there are some obvious bugs because as you've said, it doesn't time out on
some target strings.
In fact, [a-xzA-XZ\W]
works given the sample string. It times out when Y
is included anywhere
but just for that particular string.
Let's see if we can determine if this is a bug or not.
First, some tests:
Test - Fail [a-zA-Z\W]
https://rextester.com/FHUQG84843
# Test - Fail [a-zA-Z\W]
puts "Hello World!";
regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/ui;
text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ";
res = text.match(regex);
puts "Done";
Test - Pass [a-xzA-XZ\W]
https://rextester.com/RPV28606
Test - Pass [a-zA-Z\P{Word}]
https://rextester.com/DAMW9069
Conclusion: Report this as a BUG.
IMO this is a BUG with their built-in class \W
which is engine defined,
since \P{Word}
is a Unicode property defined function, not a range.
And we see that [a-zA-Z\P{Word}]
works just fine.
Use \P{Word}
inside classes as a temporary workaround.
In reality when modern-day engines were first designed, the logic of what
a negative class was [^]
each item is AND NOT which when combined with a positive
class where each item is ORed results in errors in scope.
Perl had class errors still a short time ago.