A single POSIX-Extended regex to match all shortest substrings that start with “a”, end with “b” and do not contain any element of the set of strings

Given a string, is it possible to write a single POSIX-Extended regular expression to match all shortest substrings that start with a and end with b, but do not contain cb, cd or fg? I want to use such a regex with gensub, match or split functions (in gawk). For example,

string "0a3cbsbtacc12bbb": matching substring is "acc12b";
string "a4cdddbbb5": no matching substring;
string "1taa/b///fafgfgcb2abb": matching substrings are "a/b" and "ab".

The “mark the negative matches” approach (using characters that do not occur in the whole text) is not suitable in my situation: I am only interested in a single regular expression. So, if I use, for example, split("1taa/b///fafgfgcb2abb", arr, regex, seps), I expect seps[1] == "a/b" and seps[2] == "ab".

If it is not possible to write a suitable regex (or if any suitable regex is too inefficient performance-wise), I would be interested to know why.

Solution

more general solution:

Here is an awk function that takes a string s and three regex (rs,re,rx).

Output is a list of the indices, lengths and non-overlapping sub-strings of s such that:

rs matches at start (any other matches must overlap with re)
re matches at end (any other matches must overlap with rs)
rx does not match

awk '
function findshort (s,rs,re,rx,          pos,emax,elen,aftere,i,start, \
                                              q,trimmed,best,empty,ok) {
    # find, then loop over, re matches
    for (pos = 1; pos<=length(s) && match(substr(s,pos),re); pos += RSTART) {
        ++emax
        elen[emax] = RLENGTH
        aftere[emax] = (pos-1) + RSTART+RLENGTH + (!RLENGTH)
    }
    for (start = i = 1; i<=emax; ++i) {

        # trim right
        q = substr(s,start,(aftere[i]-start)) 

        # trim left
        for (ok = 0; pos = match(q,rs); q = substr(q,pos+1)) {

            # end correction for zero-length match
            empty = !RLENGTH && !elen[i]

            trimmed = substr(q,pos,(length(q)-(pos-1)-empty))

            if (length(trimmed)<elen[i]) break
            if (!match(trimmed,rx)) {
                ok = 1
                best = trimmed
            }
            if (!elen[i]) break
        }

        if (ok) {
            print aftere[i]-length(best)-empty, length(best), best

            # prune head
            start = aftere[i]  
        }
        # else extend right
    }
}
{
    # test scaffold
    printf "---- s:%s rs:%s re:%s rx:%s ----\n", $1,$2,$3,$4
    findshort($1,$2,$3,$4)
}
'

Running on input:

1taa/b///fafgfgcb2abb   a       b       cb|cd|fg
0a3cbsbtacc12bbb        a       b       cb|cd|fg
a4cdddbbb5              a       b       cb|cd|fg
aaabcbbaaaaaaefgbbbb    aa      bb      bc
abc                     abc     b       cb|cd|fg
abcb                    abc     b       cb|cd|fg
abcab                   abc     b       cb|cd|fg
00ab2bc11               ab      bc      $^
abcdef                  x*      [bc]    z
abcdef                  abcde   d*      z
abc                     .?      .?      .
abc                     .{0}    .{0}    z
abc                     .*      .*      z
abbbabaababab           ab+     b+a     z
abcd                    a?      b?      z
1aaab/ab0               a..     .ab     aa|aba
abaa                    a       ba      c

gives:

---- s:1taa/b///fafgfgcb2abb rs:a re:b rx:cb|cd|fg ----
4 3 a/b
19 2 ab
---- s:0a3cbsbtacc12bbb rs:a re:b rx:cb|cd|fg ----
9 6 acc12b
---- s:a4cdddbbb5 rs:a re:b rx:cb|cd|fg ----
---- s:aaabcbbaaaaaaefgbbbb rs:aa re:bb rx:bc ----
12 7 aaefgbb
---- s:abc rs:abc re:b rx:cb|cd|fg ----
---- s:abcb rs:abc re:b rx:cb|cd|fg ----
---- s:abcab rs:abc re:b rx:cb|cd|fg ----
1 5 abcab
---- s:00ab2bc11 rs:ab re:bc rx:$^ ----
3 5 ab2bc
---- s:abcdef rs:x* re:[bc] rx:z ----
2 1 b
3 1 c
---- s:abcdef rs:abcde re:d* rx:z ----
1 5 abcde
---- s:abc rs:.? re:.? rx:. ----
---- s:abc rs:.{0} re:.{0} rx:z ----
1 0 
2 0 
3 0 
---- s:abc rs:.* re:.* rx:z ----
1 3 abc
---- s:abbbabaababab rs:ab+ re:b+a rx:z ----
1 5 abbba
8 3 aba
---- s:abcd rs:a? re:b? rx:z ----
1 1 a
2 1 b
3 0 
4 0 
---- s:1aaab/ab0 rs:a.. re:.ab rx:aa|aba ----
4 5 ab/ab
---- s:abaa rs:a re:ba rx:c ----
1 3 aba

Basically, the function walks along the input string and loops doing:

extract leftmost shortest head of s that ends with an re
trim LHS to rightmost rs that starts at or before start of re
if rx does not match:
- then:
  - print the trimmed string
  - delete head of s
- else:
  - extend head right to end of next match for re

The results are produced by processing the input from left to right using a specific sequence of operations. Note that if processed in some other way by a different algorithm, different results are possible. For example:

input = abbbabaababab
rs = ab+
re = b+a
rx = z

could be legitimately split as any of:

<abbba> ba <aba> bab
abbb <aba> <aba> bab
abbb <aba> ab <aba> b

original accepted answer:

A demo of generic approach, as requested:

awk  '
    {
        printf "original: %s\n", $0

        # escape input to ensure availability of substrings for marking
        gsub(/=/,"=e")
        # now `/=[^e].*/` is guaranteed to not match 

        # mark negative matches
        gsub(/cb|cd|fg/,"=x&")
        printf "marked: %s\n", $0

        # split
        n = split($0, arr, /a([^ab=]|(=[^x]))*b/, seps)

        # unmark/unescape results
        gsub(/=x/,"", seps[0])
        gsub(/=e/,"", seps[0])
        for (i=1;i<=n;i++) {
            gsub(/=x/,"", arr[i])
            gsub(/=x/,"", seps[i])
            gsub(/=e/,"=", arr[i])
            gsub(/=e/,"=", seps[i])
        }

        printf "sep[0]: %s\n", seps[0]
        for (i=1;i<=n;i++) {
            printf "pat[%i]: %s\n", i,arr[i]
            printf "sep[%i]: %s\n", i,seps[i]
        }

    }
' <<'EOD'
1taa/b///fafgfgcb2abb
EOD

giving:

original: 1taa/b///fafgfgcb2abb
marked: 1taa/b///fa=xfg=xfg=xcb2abb
sep[0]: 
pat[1]: 1ta
sep[1]: a/b
pat[2]: ///fafgfgcb2
sep[2]: ab
pat[3]: b
sep[3]:

~~(Buggy - finds leftmost shortest, not absolute shortest)~~

If a or b are strings rather than individual characters (e.g. aaaa,bbbb), they need to be marked too. Then the split would look something like:

split($0, arr, /=xaaaa([^=]|(=[^x]))*=xbbbb/, seps)