regexre2

Is this kind of regex possible without negative lookahead?


Basically the regex im looking to create is something that would match every domain google except google.com and google.com.au

So google.org google.uk or google.com.pk would be a match. Im working within the limitations of re2 and the best i've been able to come up with is

google\.([^c][^o][^m]\.?[^a]?[^u]?)

This doesnt work for the extended domains like google.com.pk and it doesnt work if the root is double digit eg. .cn instead of .org etc

It works if there's no extended domain and the root isnt two digit google.org matches google.com doesnt match

Here's the link with test cases. regexr.com/7rbkn

Im looking for a workaround for negative lookahead. Or whether its possible to accomodate this within a single regex string.


Solution

  • Sure you can. The pattern will look a bit ugly, but what you are asking for is totally possible.

    Let's assume that the input already satisfy the regex google(?:\.[a-z]+)+ (i.e. google followed by at least one domain names) for ease of explanation. If you want more precision, see this answer.

    Match a name that is not a given name

    The inverted of com would be:

    Translate that to regex and we have:

    \A                    # This means "at the very start"
    (?:
      [a-z]{1,2} |
      [a-z]{4,} |
    
      [^c.][a-z]{2} |     # Also exclude the dot,
      [a-z][^o.][a-z] |   # otherwise 'google.c.m'
      [a-z]{2}[^m.]       # would not match
    )
    \z                    # This means "at the very end"
    

    The same applies to au:

    \A(?:[a-z]|[a-z]{3,}|[^a.][a-z]|[a-z][^u.])\z
    

    Match a hostname that is not a given hostname

    There are two cases you want to avoid: google.com and google.com.au. The inverted of that would be the union of the following cases:

    Or, a bit more logical:

    That said, we only need three branches. Let com be the inverted of com, here's what the pattern looks like in pseudo-regex:

    \A
    (?:
      google\.com    (?:\.[a-z]+)*   |
      google\.com\.au(?:\.[a-z]+)*   |
      google         (?:\.[a-z]){3,}
    )
    \z
    

    See the common parts? We can extract them out:

    \A
    google
    (?:
      \.com          |
      \.com\.au      |
      (?:\.[a-z]){3}
    )
    (?:\.[a-z]+)*
    \z
    

    Insert what we had from section 1, and voilĂ .

    The final pattern

    \A
    google
    (?:
      # google.com
      \.
      (?:
        [a-z]{1,2} | [a-z]{4,} |
        [^c.][a-z]{2} |
        [a-z][^o.][a-z] |
        [a-z]{2}[^m.]
      )
    |
      # google.com.au
      \.com\.
      (?:
        [a-z] | [a-z]{3,} |
        [^a.][a-z] | [a-z][^u.]
      )
    |
      # google.*.*.*
      (?:\.[a-z]+){3}
    )
    (?:\.[a-z]+)*
    \z
    

    Try it on regex101.com: PCRE2 with comments, Go, multiline mode.