[SOLVED] How good is Oniguruma compared to other cross-platform regexp libraries?

How good is Oniguruma compared to other cross-platform regexp libraries?

We are trying to get rid of boost::regex and it's awful performance. According to this benchmark, Oniguruma is the best overall.

We have multiple regexps (and always changing) which we apply on strings ranging from medium (100 chars) to huge (1k chars)... so it's a very heterogeneous environment.

Have any of you used it with success? Do you recommend going for the more "standard" ones like PCRE or RE2?

Solution

I've done a benchmark with the following librairies:

Boost
re2
Oniguruma

The benchmark consisted of executing a series of tests which made heavy use of regexps on very heterogeneous regexps (grouping, not grouping, long ones (484 characters), short ones, pipes, ?, *, ., etc.). Applied on texts that go from a few characters to around 8k characters.

Each time a regexp match was computed, I stored the regexp and incremented a milliseconds counter accumulating the time spent computing the regexp (called multiple times).

Here is the total time spent on all regexps for each libraries:

Boost: 98 840 ms
re2: 51 197 ms
Oniguruma: 16 095 ms
re2 (NO CAPUTRE* see below)): 16 162 ms

*We (almost) always want to capture the content of groups in regexp, and re2 performs horribly when it captures a group(see here). You don't see that much in the above result because when the group cannot be captured, it performs well. For example on this regexp (executed a lot of times):

^((?:https?://)?(?:[a-z0-9\-]{1,63}\.)+(?:[a-z0-9\-]{1,63}))(?:[^\?]*).*$

here are the results for each libs:

Boost: 140 ms
re2: 5663 ms
Oniguruma: 53 ms
re2 (NO CAPTURE): 37 ms.

See the drop for re2 from 5663 ms to 37 ms.

tl;dr

So my conclusion is that for our use, Oniguruma is clearly superior.

But if you don't need to capture groups, re2 is a better choice since I found that it's API is easier to use.