regexcomplexity-theory

Regular expressions with O(N) and backreferences support


As you may know there are two different kinds of regular expressions implementations: one uses backtracking (pcre) and the other one uses finite automata (re2).

Both of those algorithms have their limitations: in specific cases pcre can take exponential time to find a match and finite automata does not support backreferences.

Pcre implementation supports backreferences, very inefficient in matching expressions like /a?a?a?a?aaaa/ against aaaa, the more a's expression and input have - the longer it will take and with 30+ of them it will take a lot if time.

Version with finite automata handles all those implementations nicely and have O(N) complexity from input, but does not supports backreferences:

pcre time against complex expressions - https://i.sstatic.net/D4gkC.png NFA handles those, but does not supports backreferences - https://i.sstatic.net/t2EwI.png

Some information on backreferences support:

RE2 - http://code.google.com/p/re2/

The one significant exception is that RE2 drops support for backreferences and generalized zero–width assertions, because they cannot be implemented efficiently.

Thompson NFA - http://swtch.com/~rsc/regexp/regexp1.html

As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it's impossible either. (Specifically, the problem is NP–complete, meaning that if someone did find an efficient implementation, that would be major news to computer scientists and would win a million dollar prize.)

So I created my own version which both supports backreferences and has O(N) complexity. It written in haskell and about 600 (~200 of them are blank and ~200 - type declarations, which can be skipped) lines long. It chews through /a?a?aa/ against aa (with 100 of a) in about 10 seconds and as far as I know it is the only version which can match

/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?(a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\1/

against

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

in sane (about 10 seconds) time. It of course supports all other features listed in basic regex specification which I found somewhere on the Internets.

The question is: is it really a "major news to computer scientists" and what should I do if it is?

PS: I will show sources in about a week - I still want to run some tests with profiler and replace several internal data structures.


Solution

  • I believe you are confused. All regular expressions can be represented by a discrete finite automata (DFA) and (because of such) be solved in O(n) time. Perl Regular Expressions (PREG) (and the regex libraries provided by many languages) match a language that is larger than regular expressions, ie: regular expressions exist in PREG.

    If you want to look more of this up search for regular languages. Every regular language can be represented by a regular expression (hence the similar names), and every regular expression represents a regular language. PREG can represent things that are not a regular language.

    Further, no one likes someone who says "I can do this and it is amazing, but I won't explain how". That alone is reason enough not to believe you (ignoring that you misunderstand what a regular expression is).