Basically the regex im looking to create is something that would match every domain google
except google.com
and google.com.au
So google.org
google.uk
or google.com.pk
would be a match. Im working within the limitations of re2 and the best i've been able to come up with is
google\.([^c][^o][^m]\.?[^a]?[^u]?)
This doesnt work for the extended domains like google.com.pk
and it doesnt work if the root is double digit eg. .cn
instead of .org
etc
It works if there's no extended domain and the root isnt two digit google.org
matches google.com
doesnt match
Here's the link with test cases. regexr.com/7rbkn
Im looking for a workaround for negative lookahead. Or whether its possible to accomodate this within a single regex string.
Sure you can. The pattern will look a bit ugly, but what you are asking for is totally possible.
Let's assume that the input already satisfy the regex google(?:\.[a-z]+)+
(i.e. google
followed by at least one domain names) for ease of explanation. If you want more precision, see this answer.
The inverted of com
would be:
c
, oro
, orm
.Translate that to regex and we have:
\A # This means "at the very start"
(?:
[a-z]{1,2} |
[a-z]{4,} |
[^c.][a-z]{2} | # Also exclude the dot,
[a-z][^o.][a-z] | # otherwise 'google.c.m'
[a-z]{2}[^m.] # would not match
)
\z # This means "at the very end"
The same applies to au
:
\A(?:[a-z]|[a-z]{3,}|[^a.][a-z]|[a-z][^u.])\z
There are two cases you want to avoid: google.com
and google.com.au
. The inverted of that would be the union of the following cases:
google.*
where *
is any name but com
google.*.*
where the first *
is any name but com
, orgoogle.com.*
where *
is any name but au
google.*.*.* ...
Or, a bit more logical:
com
, it doesn't matter how many names are left.
com
and the second name is not au
, the rest of the names are also irrelevant.
com
and au
correspondingly, then there must be at least one other name, which means there are at least three extra names.
That said, we only need three branches. Let be the inverted of com
com
, here's what the pattern looks like in pseudo-regex:
\A (?: google\.com(?:\.[a-z]+)* | google\.com\.au(?:\.[a-z]+)* | google (?:\.[a-z]){3,} ) \z
See the common parts? We can extract them out:
\A google (?: \.com| \.com\.au| (?:\.[a-z]){3} ) (?:\.[a-z]+)* \z
Insert what we had from section 1, and voilĂ .
\A google (?: # google.com\. (?: [a-z]{1,2} | [a-z]{4,} | [^c.][a-z]{2} | [a-z][^o.][a-z] | [a-z]{2}[^m.] ) | # google.com.au\.com\. (?: [a-z] | [a-z]{3,} | [^a.][a-z] | [a-z][^u.] ) | # google.*.*.* (?:\.[a-z]+){3} ) (?:\.[a-z]+)* \z
Try it on regex101.com: PCRE2 with comments, Go, multiline mode.