awk

A single POSIX-Extended regex to match all shortest substrings that start with “a”, end with “b” and do not contain any element of the set of strings


Given a string, is it possible to write a single POSIX-Extended regular expression to match all shortest substrings that start with a and end with b, but do not contain cb, cd or fg? I want to use such a regex with gensub, match or split functions (in gawk). For example,

  1. string "0a3cbsbtacc12bbb": matching substring is "acc12b";
  2. string "a4cdddbbb5": no matching substring;
  3. string "1taa/b///fafgfgcb2abb": matching substrings are "a/b" and "ab".

The “mark the negative matches” approach (using characters that do not occur in the whole text) is not suitable in my situation: I am only interested in a single regular expression. So, if I use, for example, split("1taa/b///fafgfgcb2abb", arr, regex, seps), I expect seps[1] == "a/b" and seps[2] == "ab".

If it is not possible to write a suitable regex (or if any suitable regex is too inefficient performance-wise), I would be interested to know why.


Solution

  • more general solution:

    Here is an awk function that takes a string s and three regex (rs,re,rx).

    Output is a list of the indices, lengths and non-overlapping sub-strings of s such that:

    awk '
    function findshort (s,rs,re,rx,          pos,emax,elen,aftere,i,start, \
                                                  q,trimmed,best,empty,ok) {
        # find, then loop over, re matches
        for (pos = 1; pos<=length(s) && match(substr(s,pos),re); pos += RSTART) {
            ++emax
            elen[emax] = RLENGTH
            aftere[emax] = (pos-1) + RSTART+RLENGTH + (!RLENGTH)
        }
        for (start = i = 1; i<=emax; ++i) {
    
            # trim right
            q = substr(s,start,(aftere[i]-start)) 
    
            # trim left
            for (ok = 0; pos = match(q,rs); q = substr(q,pos+1)) {
    
                # end correction for zero-length match
                empty = !RLENGTH && !elen[i]
    
                trimmed = substr(q,pos,(length(q)-(pos-1)-empty))
    
                if (length(trimmed)<elen[i]) break
                if (!match(trimmed,rx)) {
                    ok = 1
                    best = trimmed
                }
                if (!elen[i]) break
            }
    
            if (ok) {
                print aftere[i]-length(best)-empty, length(best), best
    
                # prune head
                start = aftere[i]  
            }
            # else extend right
        }
    }
    {
        # test scaffold
        printf "---- s:%s rs:%s re:%s rx:%s ----\n", $1,$2,$3,$4
        findshort($1,$2,$3,$4)
    }
    '
    

    Running on input:

    1taa/b///fafgfgcb2abb   a       b       cb|cd|fg
    0a3cbsbtacc12bbb        a       b       cb|cd|fg
    a4cdddbbb5              a       b       cb|cd|fg
    aaabcbbaaaaaaefgbbbb    aa      bb      bc
    abc                     abc     b       cb|cd|fg
    abcb                    abc     b       cb|cd|fg
    abcab                   abc     b       cb|cd|fg
    00ab2bc11               ab      bc      $^
    abcdef                  x*      [bc]    z
    abcdef                  abcde   d*      z
    abc                     .?      .?      .
    abc                     .{0}    .{0}    z
    abc                     .*      .*      z
    abbbabaababab           ab+     b+a     z
    abcd                    a?      b?      z
    1aaab/ab0               a..     .ab     aa|aba
    abaa                    a       ba      c
    

    gives:

    ---- s:1taa/b///fafgfgcb2abb rs:a re:b rx:cb|cd|fg ----
    4 3 a/b
    19 2 ab
    ---- s:0a3cbsbtacc12bbb rs:a re:b rx:cb|cd|fg ----
    9 6 acc12b
    ---- s:a4cdddbbb5 rs:a re:b rx:cb|cd|fg ----
    ---- s:aaabcbbaaaaaaefgbbbb rs:aa re:bb rx:bc ----
    12 7 aaefgbb
    ---- s:abc rs:abc re:b rx:cb|cd|fg ----
    ---- s:abcb rs:abc re:b rx:cb|cd|fg ----
    ---- s:abcab rs:abc re:b rx:cb|cd|fg ----
    1 5 abcab
    ---- s:00ab2bc11 rs:ab re:bc rx:$^ ----
    3 5 ab2bc
    ---- s:abcdef rs:x* re:[bc] rx:z ----
    2 1 b
    3 1 c
    ---- s:abcdef rs:abcde re:d* rx:z ----
    1 5 abcde
    ---- s:abc rs:.? re:.? rx:. ----
    ---- s:abc rs:.{0} re:.{0} rx:z ----
    1 0 
    2 0 
    3 0 
    ---- s:abc rs:.* re:.* rx:z ----
    1 3 abc
    ---- s:abbbabaababab rs:ab+ re:b+a rx:z ----
    1 5 abbba
    8 3 aba
    ---- s:abcd rs:a? re:b? rx:z ----
    1 1 a
    2 1 b
    3 0 
    4 0 
    ---- s:1aaab/ab0 rs:a.. re:.ab rx:aa|aba ----
    4 5 ab/ab
    ---- s:abaa rs:a re:ba rx:c ----
    1 3 aba
    

    Basically, the function walks along the input string and loops doing:

    The results are produced by processing the input from left to right using a specific sequence of operations. Note that if processed in some other way by a different algorithm, different results are possible. For example:

    input = abbbabaababab
    rs = ab+
    re = b+a
    rx = z
    

    could be legitimately split as any of:

    <abbba> ba <aba> bab
    abbb <aba> <aba> bab
    abbb <aba> ab <aba> b
    

    original accepted answer:

    A demo of generic approach, as requested:

    awk  '
        {
            printf "original: %s\n", $0
    
            # escape input to ensure availability of substrings for marking
            gsub(/=/,"=e")
            # now `/=[^e].*/` is guaranteed to not match 
    
            # mark negative matches
            gsub(/cb|cd|fg/,"=x&")
            printf "marked: %s\n", $0
    
            # split
            n = split($0, arr, /a([^ab=]|(=[^x]))*b/, seps)
    
            # unmark/unescape results
            gsub(/=x/,"", seps[0])
            gsub(/=e/,"", seps[0])
            for (i=1;i<=n;i++) {
                gsub(/=x/,"", arr[i])
                gsub(/=x/,"", seps[i])
                gsub(/=e/,"=", arr[i])
                gsub(/=e/,"=", seps[i])
            }
    
            printf "sep[0]: %s\n", seps[0]
            for (i=1;i<=n;i++) {
                printf "pat[%i]: %s\n", i,arr[i]
                printf "sep[%i]: %s\n", i,seps[i]
            }
    
        }
    ' <<'EOD'
    1taa/b///fafgfgcb2abb
    EOD
    

    giving:

    original: 1taa/b///fafgfgcb2abb
    marked: 1taa/b///fa=xfg=xfg=xcb2abb
    sep[0]: 
    pat[1]: 1ta
    sep[1]: a/b
    pat[2]: ///fafgfgcb2
    sep[2]: ab
    pat[3]: b
    sep[3]: 
    

    (Buggy - finds leftmost shortest, not absolute shortest)

    If a or b are strings rather than individual characters (e.g. aaaa,bbbb), they need to be marked too. Then the split would look something like:

    split($0, arr, /=xaaaa([^=]|(=[^x]))*=xbbbb/, seps)