awk

AWK global match function -- how can I improve it?


When using awk's match() function I find only the first match of a given search pattern. I find this sort of limiting, so I'm trying to find something that gives me all of the matches in a record, possibly in one call. For this reason I'm using this code further down, which for my needs works reasonably well. As it is, it works only with dynamic regexes, which could be a source of trouble down the road by not being able to use regex constants.

Having said that, I want to know:

Thank you!

## gmatch.awk - global match
## gmatch function

function gmatch(current_rec,pat,matched_parts_arr,index_matched_parts_arr,    x,long,index_match,i,matched_part)
{
       { i=1; long=x=index_match=0; matched_part="";                           
         delete matched_parts_arr; delete index_matched_parts_arr;          

         if ( pat == "" ) next                  

         while(i){
                    x = match(current_rec,pat)
                    if(x==0) break 
                    if(i==1) index_match = index_match + long + x  
                    if(i>1)  index_match = index_match + long + x - 1 
                    matched_part = substr(current_rec,RSTART,RLENGTH)
                    long = length(matched_part)
                    matched_parts_arr[i] = matched_part
                    index_matched_parts_arr[i] = index_match
                    current_rec = substr(current_rec,(x+long))
                    i++
                 }
       return (i-1)
      }
}

A simple use case:

( gmatch.awk function file is placed in awk's /usr/share/awk functions directory and called it with the -i switch, GNU Awk 5.3.0)

echo "vvcccbittehhencbjkiljnlvbjdcvjducijlvkcuvbhc" | awk -i gmatch.awk -v pat="c+" '{ n=gmatch($0,pat,a1,a2); for(i=1;i<=n;i++) print a1[i], a2[i]}'
ccc 3 6
c 15 6
c 28 6
c 33 6
c 39 6
c 44 6

Quick explanation

## The code basically repeatedly loops over the record matching the first match, saving it in an array and its index in another array. 
## All this happens while we continually trim the current record after each match and the consequent match is done over the left over record.

function gmatch(current_rec, pat, matched_parts_arr, index_matched_parts_arr,      x, long, index_match, i, matched_part)
{
       { i=1; long=x=index_match=0; matched_part="";                            ## reset variables
         delete matched_parts_arr; delete index_matched_parts_arr;              ## empty arrays, ready for the next record

         if ( pat == "" ) next                  ## handle empty pattern

         while(i){
                    x = match(current_rec,pat)
                    
                    if(x==0) break                                      ## no matches, just break out
                    
                    if(i==1) index_match = index_match + long + x       ## index the first match, (for i==1 "index_match" and "long" are 0, it's there just for code consistency).
                    
                    if(i>1)  index_match = index_match + long + x - 1   ## for any further matches, recalculate its new index considering the old previous index and length. 
                    
                    matched_part = substr(current_rec,RSTART,RLENGTH)   ## extract the current match
                    
                    long = length(matched_part)                         ## note its length
                    
                    matched_parts_arr[i] = matched_part                 ## save match in array matched_parts_arr
                    
                    index_matched_parts_arr[i] = index_match            ## save its index in array index_matched_parts_arr
                    
                    current_rec = substr(current_rec,(x+long))          ## recalculate the record to have match() run again on the updated record on the next loop's iteration
                    
                    i++                                                 ## increase the counter
                 }
       return (i-1)                                                     ## return the final total number of matches
      }
}

[*] This is just a small project on my free time.


Solution

  • If you don't mind being constrained to using GNU awk, you could consider using a similar synopsis as the existing GNU match() has when called with an array to populate with the matching string and any capture groups defined in the regexp, e.g.:

    $ cat tst.awk
    function gmatch(str, re, arr,   numMatches, j, prevEndPos) {
        delete arr
        while ( match( substr(str,prevEndPos+1), re, arr[++numMatches]) ) {
            for ( j in arr[numMatches] ) {
                if ( j ~ /^[0-9]+$/ ) {
                    arr[numMatches][j,"start"] += prevEndPos
                    arr[numMatches]["groups"]++
                }
            }
            prevEndPos += (RSTART + RLENGTH)
        }
        return numMatches-1
    }
    
    {
        n = gmatch($0, re, arr)
    
        for ( i=1; i<=n; i++ ) {
            print "\n" i, "groups =", arr[i]["groups"]
            for ( j=0; j<arr[i]["groups"]; j++ ) {
                print i, j, "start =\t"   arr[i][j,"start"]
                print i, j, "length = \t" arr[i][j,"length"]
                print i, j, "value = \t"  arr[i][j]
            }
            print "====="
        }
    }
    

    That populates the array arr[] to contain not just the info about all of the strings in the current record that match the whole regexp, but also the values, lengths, and start positions of every substring that matches a capture group from the regexp, e.g.:

    $ echo 'abfoodbarn cdefooobarrr' | awk -v re='@/(fo+)\S+(bar*)/' -f tst.awk
    
    1 groups = 3
    1 0 start =     3
    1 0 length =    7
    1 0 value =     foodbar
    1 1 start =     3
    1 1 length =    3
    1 1 value =     foo
    1 2 start =     7
    1 2 length =    3
    1 2 value =     bar
    =====
    
    2 groups = 3
    2 0 start =     15
    2 0 length =    9
    2 0 value =     fooobarrr
    2 1 start =     15
    2 1 length =    3
    2 1 value =     foo
    2 2 start =     19
    2 2 length =    5
    2 2 value =     barrr
    =====
    

    The index "groups" that I added in addition to the values that GNU match() automatically populates is to provide an easy way to tell how many matched strings are in every part of the array, but you probably won't need that as the person writing the regexp generally knows how many capture groups are present in that regexp.

    The OPs regexp doesn't contain any capture groups so it's not as interesting but here is its output:

    $ echo "vvcccbittehhencbjkiljnlvbjdcvjducijlvkcuvbhc" | awk -v re="c+" -f tst.awk
    
    1 groups = 1
    1 0 start =     3
    1 0 length =    3
    1 0 value =     ccc
    =====
    
    2 groups = 1
    2 0 start =     15
    2 0 length =    1
    2 0 value =     c
    =====
    
    3 groups = 1
    3 0 start =     28
    3 0 length =    1
    3 0 value =     c
    =====
    
    4 groups = 1
    4 0 start =     33
    4 0 length =    1
    4 0 value =     c
    =====
    
    5 groups = 1
    5 0 start =     39
    5 0 length =    1
    5 0 value =     c
    =====
    
    6 groups = 1
    6 0 start =     44
    6 0 length =    1
    6 0 value =     c
    =====