When using awk's match() function I find only the first match of a given search pattern. I find this sort of limiting, so I'm trying to find something that gives me all of the matches in a record, possibly in one call. For this reason I'm using this code further down, which for my needs works reasonably well. As it is, it works only with dynamic regexes, which could be a source of trouble down the road by not being able to use regex constants.
Having said that, I want to know:
How can this regex issue become more tolerable ?
Do you see any other corner cases where it could fail and how can they be resolved ?
What improvements can be made to it or extended even more ?
Thank you!
## gmatch.awk - global match
## gmatch function
function gmatch(current_rec,pat,matched_parts_arr,index_matched_parts_arr, x,long,index_match,i,matched_part)
{
{ i=1; long=x=index_match=0; matched_part="";
delete matched_parts_arr; delete index_matched_parts_arr;
if ( pat == "" ) next
while(i){
x = match(current_rec,pat)
if(x==0) break
if(i==1) index_match = index_match + long + x
if(i>1) index_match = index_match + long + x - 1
matched_part = substr(current_rec,RSTART,RLENGTH)
long = length(matched_part)
matched_parts_arr[i] = matched_part
index_matched_parts_arr[i] = index_match
current_rec = substr(current_rec,(x+long))
i++
}
return (i-1)
}
}
A simple use case:
( gmatch.awk function file is placed in awk's /usr/share/awk functions directory and called it with the -i switch, GNU Awk 5.3.0)
echo "vvcccbittehhencbjkiljnlvbjdcvjducijlvkcuvbhc" | awk -i gmatch.awk -v pat="c+" '{ n=gmatch($0,pat,a1,a2); for(i=1;i<=n;i++) print a1[i], a2[i]}'
ccc 3 6
c 15 6
c 28 6
c 33 6
c 39 6
c 44 6
Quick explanation
## The code basically repeatedly loops over the record matching the first match, saving it in an array and its index in another array.
## All this happens while we continually trim the current record after each match and the consequent match is done over the left over record.
function gmatch(current_rec, pat, matched_parts_arr, index_matched_parts_arr, x, long, index_match, i, matched_part)
{
{ i=1; long=x=index_match=0; matched_part=""; ## reset variables
delete matched_parts_arr; delete index_matched_parts_arr; ## empty arrays, ready for the next record
if ( pat == "" ) next ## handle empty pattern
while(i){
x = match(current_rec,pat)
if(x==0) break ## no matches, just break out
if(i==1) index_match = index_match + long + x ## index the first match, (for i==1 "index_match" and "long" are 0, it's there just for code consistency).
if(i>1) index_match = index_match + long + x - 1 ## for any further matches, recalculate its new index considering the old previous index and length.
matched_part = substr(current_rec,RSTART,RLENGTH) ## extract the current match
long = length(matched_part) ## note its length
matched_parts_arr[i] = matched_part ## save match in array matched_parts_arr
index_matched_parts_arr[i] = index_match ## save its index in array index_matched_parts_arr
current_rec = substr(current_rec,(x+long)) ## recalculate the record to have match() run again on the updated record on the next loop's iteration
i++ ## increase the counter
}
return (i-1) ## return the final total number of matches
}
}
[*] This is just a small project on my free time.
If you don't mind being constrained to using GNU awk, you could consider using a similar synopsis as the existing GNU match()
has when called with an array to populate with the matching string and any capture groups defined in the regexp, e.g.:
$ cat tst.awk
function gmatch(str, re, arr, numMatches, j, prevEndPos) {
delete arr
while ( match( substr(str,prevEndPos+1), re, arr[++numMatches]) ) {
for ( j in arr[numMatches] ) {
if ( j ~ /^[0-9]+$/ ) {
arr[numMatches][j,"start"] += prevEndPos
arr[numMatches]["groups"]++
}
}
prevEndPos += (RSTART + RLENGTH)
}
return numMatches-1
}
{
n = gmatch($0, re, arr)
for ( i=1; i<=n; i++ ) {
print "\n" i, "groups =", arr[i]["groups"]
for ( j=0; j<arr[i]["groups"]; j++ ) {
print i, j, "start =\t" arr[i][j,"start"]
print i, j, "length = \t" arr[i][j,"length"]
print i, j, "value = \t" arr[i][j]
}
print "====="
}
}
That populates the array arr[]
to contain not just the info about all of the strings in the current record that match the whole regexp, but also the values, lengths, and start positions of every substring that matches a capture group from the regexp, e.g.:
$ echo 'abfoodbarn cdefooobarrr' | awk -v re='@/(fo+)\S+(bar*)/' -f tst.awk
1 groups = 3
1 0 start = 3
1 0 length = 7
1 0 value = foodbar
1 1 start = 3
1 1 length = 3
1 1 value = foo
1 2 start = 7
1 2 length = 3
1 2 value = bar
=====
2 groups = 3
2 0 start = 15
2 0 length = 9
2 0 value = fooobarrr
2 1 start = 15
2 1 length = 3
2 1 value = foo
2 2 start = 19
2 2 length = 5
2 2 value = barrr
=====
The index "groups"
that I added in addition to the values that GNU match() automatically populates is to provide an easy way to tell how many matched strings are in every part of the array, but you probably won't need that as the person writing the regexp generally knows how many capture groups are present in that regexp.
The OPs regexp doesn't contain any capture groups so it's not as interesting but here is its output:
$ echo "vvcccbittehhencbjkiljnlvbjdcvjducijlvkcuvbhc" | awk -v re="c+" -f tst.awk
1 groups = 1
1 0 start = 3
1 0 length = 3
1 0 value = ccc
=====
2 groups = 1
2 0 start = 15
2 0 length = 1
2 0 value = c
=====
3 groups = 1
3 0 start = 28
3 0 length = 1
3 0 value = c
=====
4 groups = 1
4 0 start = 33
4 0 length = 1
4 0 value = c
=====
5 groups = 1
5 0 start = 39
5 0 length = 1
5 0 value = c
=====
6 groups = 1
6 0 start = 44
6 0 length = 1
6 0 value = c
=====