regexgoregex-lookaroundsexclude-constraint

RegEx to exclude two strings without negative lookahead


I need to exclude any file that has name BUILD.bazel or WORKSPACE.bazel (case sensitive). I cannot use negative lookahead as Go regex does not support negative lookahead. I have this.

I tried without negative lookahead:

(^([^B].*$)|(B([^U].*$|$))|(BU(^I].*$|$))|(BUI([^L].*$|$))|(BUIL([^D].*$|$))|(BUILD([^.].*$|$)))|(^([^W].*$)|(W([^O].*$|$))|(WO(^R].*$|$))|(WOR([^K].*$|$))|(WORK([^S].*$|$))|(WORKS([^P].*$|$))|(WORKSP([^A].*$|$))|(WORKSPA([^C].*$|$))|(WORKSPAC([^E].*$|$))|(WORKSPACE([^.].*$|$))).*(?:\.bazel)

The above excludes BUILD.bazel successfully but does not exclude WORKSPACE.bazel.

I break them up, they work fine.

(^([^B].*$)|(B([^U].*$|$))|(BU(^I].*$|$))|(BUI([^L].*$|$))|(BUIL([^D].*$|$))|(BUILD([^.].*$|$)));

(WOR([^K].*$|$))|(WORK([^S].*$|$))|(WORKS([^P].*$|$))|(WORKSP([^A].*$|$))|(WORKSPA([^C].*$|$))|(WORKSPAC([^E].*$|$))|(WORKSPACE([^.].*$|$)))

What am I doing wrong?


Solution

  • The sane approach to this would be to capture what you capture, and then separately reject these specific strings later on in your code.

    If you are really hellbent on doing this with a regex, you need to basically build a tree. You already seem to understand how to do this in principle, so I will only sketch out the first few parts.

    ^([^BW]|B(|[^U]|U(|[^I]|I(...)))|W(|[^O]|O(|[^R]|R(...))))\.bazel$

    In other words, if the string does not start with B or W, we are fine. If it starts with B but the next character is (nothing, as in we reached .bazel already, or) not U, we are fine. If it starts with BU but the next character is (nothing or) not I, we are fine. Etc. Similarly, if it starts with W but the next character is not O ...

    The above expression requires there to be at least one character before .bazel; it should be fairly obvious how to tweak that if you need to permit nothing before the extension. Also, based on your attempt, I require the file name to end with .bazel unconditionally.

    Your requirements are vague on whether WORKSPACE0.bazel should be permitted so the final branches of the tree might require some additional thought. Do you want to permit WORKSPACE..bazel? What about WORKSPACE.bazel.bazel?

    The problem with your attempt was that the branch which excludes BUILD would happily accept WORKSPACE. The beginning of the tree needs to prevent the regex engine from escaping into the other branch to reach a match when the current branch is blocking it.