Given a string, is it possible to write a single POSIX-Extended regular expression to match all shortest substrings that start with a
and end with b
, but do not contain cb
, cd
or fg
? I want to use such a regex with gensub
, match
or split
functions (in gawk
). For example,
"0a3cbsbtacc12bbb"
: matching substring is "acc12b"
;"a4cdddbbb5"
: no matching substring;"1taa/b///fafgfgcb2abb"
: matching substrings are "a/b"
and "ab"
.The “mark the negative matches” approach (using characters that do not occur in the whole text) is not suitable in my situation: I am only interested in a single regular expression. So, if I use, for example, split("1taa/b///fafgfgcb2abb", arr, regex, seps)
, I expect seps[1] == "a/b"
and seps[2] == "ab"
.
If it is not possible to write a suitable regex (or if any suitable regex is too inefficient performance-wise), I would be interested to know why.
Here is an awk function that takes a string s
and three regex (rs
,re
,rx
).
Output is a list of the indices, lengths and non-overlapping sub-strings of s
such that:
rs
matches at start (any other matches must overlap with re
)re
matches at end (any other matches must overlap with rs
)rx
does not matchawk '
function findshort (s,rs,re,rx, pos,emax,elen,aftere,i,start, \
q,trimmed,best,empty,ok) {
# find, then loop over, re matches
for (pos = 1; pos<=length(s) && match(substr(s,pos),re); pos += RSTART) {
++emax
elen[emax] = RLENGTH
aftere[emax] = (pos-1) + RSTART+RLENGTH + (!RLENGTH)
}
for (start = i = 1; i<=emax; ++i) {
# trim right
q = substr(s,start,(aftere[i]-start))
# trim left
for (ok = 0; pos = match(q,rs); q = substr(q,pos+1)) {
# end correction for zero-length match
empty = !RLENGTH && !elen[i]
trimmed = substr(q,pos,(length(q)-(pos-1)-empty))
if (length(trimmed)<elen[i]) break
if (!match(trimmed,rx)) {
ok = 1
best = trimmed
}
if (!elen[i]) break
}
if (ok) {
print aftere[i]-length(best)-empty, length(best), best
# prune head
start = aftere[i]
}
# else extend right
}
}
{
# test scaffold
printf "---- s:%s rs:%s re:%s rx:%s ----\n", $1,$2,$3,$4
findshort($1,$2,$3,$4)
}
'
Running on input:
1taa/b///fafgfgcb2abb a b cb|cd|fg
0a3cbsbtacc12bbb a b cb|cd|fg
a4cdddbbb5 a b cb|cd|fg
aaabcbbaaaaaaefgbbbb aa bb bc
abc abc b cb|cd|fg
abcb abc b cb|cd|fg
abcab abc b cb|cd|fg
00ab2bc11 ab bc $^
abcdef x* [bc] z
abcdef abcde d* z
abc .? .? .
abc .{0} .{0} z
abc .* .* z
abbbabaababab ab+ b+a z
abcd a? b? z
1aaab/ab0 a.. .ab aa|aba
abaa a ba c
gives:
---- s:1taa/b///fafgfgcb2abb rs:a re:b rx:cb|cd|fg ----
4 3 a/b
19 2 ab
---- s:0a3cbsbtacc12bbb rs:a re:b rx:cb|cd|fg ----
9 6 acc12b
---- s:a4cdddbbb5 rs:a re:b rx:cb|cd|fg ----
---- s:aaabcbbaaaaaaefgbbbb rs:aa re:bb rx:bc ----
12 7 aaefgbb
---- s:abc rs:abc re:b rx:cb|cd|fg ----
---- s:abcb rs:abc re:b rx:cb|cd|fg ----
---- s:abcab rs:abc re:b rx:cb|cd|fg ----
1 5 abcab
---- s:00ab2bc11 rs:ab re:bc rx:$^ ----
3 5 ab2bc
---- s:abcdef rs:x* re:[bc] rx:z ----
2 1 b
3 1 c
---- s:abcdef rs:abcde re:d* rx:z ----
1 5 abcde
---- s:abc rs:.? re:.? rx:. ----
---- s:abc rs:.{0} re:.{0} rx:z ----
1 0
2 0
3 0
---- s:abc rs:.* re:.* rx:z ----
1 3 abc
---- s:abbbabaababab rs:ab+ re:b+a rx:z ----
1 5 abbba
8 3 aba
---- s:abcd rs:a? re:b? rx:z ----
1 1 a
2 1 b
3 0
4 0
---- s:1aaab/ab0 rs:a.. re:.ab rx:aa|aba ----
4 5 ab/ab
---- s:abaa rs:a re:ba rx:c ----
1 3 aba
Basically, the function walks along the input string and loops doing:
s
that ends with an re
rs
that starts at or before start of re
rx
does not match:
s
re
The results are produced by processing the input from left to right using a specific sequence of operations. Note that if processed in some other way by a different algorithm, different results are possible. For example:
input = abbbabaababab
rs = ab+
re = b+a
rx = z
could be legitimately split as any of:
<abbba> ba <aba> bab
abbb <aba> <aba> bab
abbb <aba> ab <aba> b
A demo of generic approach, as requested:
awk '
{
printf "original: %s\n", $0
# escape input to ensure availability of substrings for marking
gsub(/=/,"=e")
# now `/=[^e].*/` is guaranteed to not match
# mark negative matches
gsub(/cb|cd|fg/,"=x&")
printf "marked: %s\n", $0
# split
n = split($0, arr, /a([^ab=]|(=[^x]))*b/, seps)
# unmark/unescape results
gsub(/=x/,"", seps[0])
gsub(/=e/,"", seps[0])
for (i=1;i<=n;i++) {
gsub(/=x/,"", arr[i])
gsub(/=x/,"", seps[i])
gsub(/=e/,"=", arr[i])
gsub(/=e/,"=", seps[i])
}
printf "sep[0]: %s\n", seps[0]
for (i=1;i<=n;i++) {
printf "pat[%i]: %s\n", i,arr[i]
printf "sep[%i]: %s\n", i,seps[i]
}
}
' <<'EOD'
1taa/b///fafgfgcb2abb
EOD
giving:
original: 1taa/b///fafgfgcb2abb
marked: 1taa/b///fa=xfg=xfg=xcb2abb
sep[0]:
pat[1]: 1ta
sep[1]: a/b
pat[2]: ///fafgfgcb2
sep[2]: ab
pat[3]: b
sep[3]:
(Buggy - finds leftmost shortest, not absolute shortest)
If a
or b
are strings rather than individual characters (e.g. aaaa
,bbbb
),
they need to be marked too. Then the split would look something like:
split($0, arr, /=xaaaa([^=]|(=[^x]))*=xbbbb/, seps)