regexperltheorycontext-free-grammarlanguage-theory

The recognizing power of "modern" regexes


What class of languages do real modern regexes actually recognise?

Whenever there is an unbounded length capturing group with a back-reference (e.g. (.*)_\1) a regex is now matching a non-regular language. But this, on its own, isn't enough to match something like S ::= '(' S ')' | ε — the context-free language of matching pairs of parens.

Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.

Has anyone done or read any research in this area? What are the limitations of these "modern" regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG and the opposite?

Links to relevant papers would be much appreciated.


Solution

  • Pattern Recursion

    With recursive patterns, you have a form of recursive descent matching.

    This is fine for a variety of problems, but once you want to actually do recursive descent parsing, you need to insert capture groups here and there, and it is awkward to recover the full parse structure in this way. Damian Conway’s Regexp::Grammars module for Perl transforms the simple pattern into an equivalent one that automatically does all that named capturing into a recursive data structure, making for far easier retrieval of the parsed structure. I have a sample comparing these two approaches at end of this posting.

    Restrictions on Recursion

    The question was what kinds of grammars that recursive patterns can match. Well, they’re certainly recursive descent type matchers. The only thing that comes to mind is that recursive patterns cannot handle left recursion. This puts a constraint on the sorts of grammars that you can apply them to. Sometimes you can reorder your productions to eliminate left recursion.

    BTW, PCRE and Perl differ slightly on how you’re allowed to phrase the recursion. See the sections on “RECURSIVE PATTERNS” and “Recursion difference from Perl” in the pcrepattern manpage. eg: Perl can handle ^(.|(.)(?1)\2)$ where PCRE requires ^((.)(?1)\2|.)$ instead.

    Recursion Demos

    The need for recursive patterns arises surprisingly frequently. One well-visited example is when you need to match something that can nest, such as balanced parentheses, quotes, or even HTML/XML tags. Here’s the match for balenced parens:

    \((?:[^()]*+|(?0))*\)
    

    I find that trickier to read because of its compact nature. This is easily curable with /x mode to make whitespace no longer significant:

    \( (?: [^()] *+ | (?0) )* \)
    

    Then again, since we’re using parens for our recursion, a clearer example would be matching nested single quotes:

    ‘ (?: [^‘’] *+ | (?0) )* ’
    

    Another recursively defined thing you may wish to match would be a palindrome. This simple pattern works in Perl:

    ^((.)(?1)\2|.?)$
    

    which you can test on most systems using something like this:

    $ perl -nle 'print if /^((.)(?1)\2|.?)$/i' /usr/share/dict/words
    

    Note that PCRE’s implementation of recursion requires the more elaborate

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
    

    This is because of restrictions on how PCRE recursion works.

    Proper Parsing

    To me, the examples above are mostly toy matches, not all that interesting, really. When it becomes interesting is when you have a real grammar you’re trying to parse. For example, RFC 5322 defines a mail address rather elaborately. Here’s a “grammatical” pattern to match it:

    $rfc5322 = qr{
    
       (?(DEFINE)
    
         (?<address>         (?&mailbox) | (?&group))
         (?<mailbox>         (?&name_addr) | (?&addr_spec))
         (?<name_addr>       (?&display_name)? (?&angle_addr))
         (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
         (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
         (?<display_name>    (?&phrase))
         (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)
    
         (?<addr_spec>       (?&local_part) \@ (?&domain))
         (?<local_part>      (?&dot_atom) | (?&quoted_string))
         (?<domain>          (?&dot_atom) | (?&domain_literal))
         (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                       \] (?&CFWS)?)
         (?<dcontent>        (?&dtext) | (?&quoted_pair))
         (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
    
         (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
         (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
         (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
         (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)
    
         (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
         (?<quoted_pair>     \\ (?&text))
    
         (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
         (?<qcontent>        (?&qtext) | (?&quoted_pair))
         (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                              (?&FWS)? (?&DQUOTE) (?&CFWS)?)
    
         (?<word>            (?&atom) | (?&quoted_string))
         (?<phrase>          (?&word)+)
    
         # Folding white space
         (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
         (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
         (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
         (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
         (?<CFWS>            (?: (?&FWS)? (?&comment))*
                             (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
    
         # No whitespace control
         (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
    
         (?<ALPHA>           [A-Za-z])
         (?<DIGIT>           [0-9])
         (?<CRLF>            \x0d \x0a)
         (?<DQUOTE>          ")
         (?<WSP>             [\x20\x09])
       )
    
       (?&address)
    
    }x;
    

    As you see, that’s very BNF-like. The problem is it is just a match, not a capture. And you really don’t want to just surround the whole thing with capturing parens because that doesn’t tell you which production matched which part. Using the previously mentioned Regexp::Grammars module, we can.

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use 5.010;
    use Data::Dumper "Dumper";
    
    my $rfc5322 = do {
        use Regexp::Grammars;    # ...the magic is lexically scoped
        qr{
    
        # Keep the big stick handy, just in case...
        # <debug:on>
    
        # Match this...
        <address>
    
        # As defined by these...
        <token: address>         <mailbox> | <group>
        <token: mailbox>         <name_addr> | <addr_spec>
        <token: name_addr>       <display_name>? <angle_addr>
        <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
        <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
        <token: display_name>    <phrase>
        <token: mailbox_list>    <[mailbox]> ** (,)
    
        <token: addr_spec>       <local_part> \@ <domain>
        <token: local_part>      <dot_atom> | <quoted_string>
        <token: domain>          <dot_atom> | <domain_literal>
        <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?
    
        <token: dcontent>        <dtext> | <quoted_pair>
        <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]
    
        <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
        <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
        <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
        <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*
    
        <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
        <token: quoted_pair>     \\ <.text>
    
        <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
        <token: qcontent>        <.qtext> | <.quoted_pair>
        <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                                 <.FWS>? <.DQUOTE> <.CFWS>?
    
        <token: word>            <.atom> | <.quoted_string>
        <token: phrase>          <.word>+
    
        # Folding white space
        <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
        <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
        <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
        <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
        <token: CFWS>            (?: <.FWS>? <.comment>)*
                                 (?: (?:<.FWS>? <.comment>) | <.FWS>)
    
        # No whitespace control
        <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]
        <token: ALPHA>           [A-Za-z]
        <token: DIGIT>           [0-9]
        <token: CRLF>            \x0d \x0a
        <token: DQUOTE>          "
        <token: WSP>             [\x20\x09]
        }x;
    };
    
    while (my $input = <>) {
        if ($input =~ $rfc5322) {
            say Dumper \%/;       # ...the parse tree of any successful match
                                  # appears in this punctuation variable
        }
    }
    

    As you see, by using a very slightly different notation in the pattern, you now get something which stores the entire parse tree away for you in the %/ variable, with everything neatly labelled. The result of the transformation is still a pattern, as you can see by the =~ operator. It’s just a bit magical.