regexgroovy

Groovy catch inner group in nested regex expression


For some context: I have a big file. File contain some data that I want to capture. I know said data follow a specific format. In my current case, I have 3, so I made three different regex to catch them:

def pattern1 = ~/arg1:\s([\w\s\.\-\:]+)/
def pattern2 = ~/arg2\s\-\s([\w\s\.\-\:]+)/
def pattern3 = ~/arg3="([\w\s\.\-\:]+)"/

Now, I want to catch those and process the data. For my previous files I could always manage to use only one regex, because the format was the same, only the argument name was different, so there was no problem, the code was something like that:

def pattern = ~/\s*([\w\s\_]+)\s*="\s*([\w\s\.\-\:\_\/=]+)\s*"/
(data.value =~ pattern).each { match ->
    def argName= match[1]
    def value = match[2]
    switch (argName) {
      case "arg1":
        newValue=transformArg1(value)
      case "arg2":
        newValue=transformArg2(value)
      case "arg3":
        newValue=transformArg3(value)
    }
}

Simple enough (note that data.value contains my file data). But now that I have multiple regex, I can't manage to get the value I want. Note that for this particular case, I don't care what the argument name is, the same process will be applied to the value; the argName is just usefull to find the value in the file. So my current code is like that:

def pattern1 = ~/arg1:\s([\w\s\.\-\:]+)/
def pattern2 = ~/arg2\s\-\s([\w\s\.\-\:]+)/
def pattern3 = ~/arg3="([\w\s\.\-\:]+)"/
def globalPattern = ~/(${pattern1}|${pattern2}|${pattern3})/

(data.value =~ globalPattern).each { match ->
  def value= match[1]
    newValue =  transformArg(value)
}

BUT, since globalPattern contains a group, match[1] contains the whole match. Not the inner group contained in the matchin pattern. For exemple, if my file contain a line exemple line containing interesting data arg1: data_to_capture, then the first pattern will match, and match[1] will contain the whole match, arg1: data_to_capture, instead of only the group defined in pattern1 which is only data_to_capture.

I've tried match[2], but I got an out of range (makes sense, globalPattern contains only one group, the other group is an "inner" group). I've tried match[1][1]. ChatGPT suggested match[1][0][1]. I've spent hours searching for groovy doc on the matter and trying different syntaxes but none worked.

I realize I could make one pattern like that: def pattern= ~/[arg1|arg2|arg3][:=\s"]+([\w\s\.\-\:]+)/ But I'm afraid it may catch too much. Like, maybe I could have some lines in my files as exemple problematic line arg1="data_to_be_left_alone", and it would be caught by the pattern when I don't want it. The chances are slim but it's still a possibility, I'd rather stick to my three precises regexes...if anyone has a solution.


Solution

  • def r1 = /a=(\d+)/
    def r2 = /b:(\d+)/
    def r3 = /c@(\d+)/
    
    def rx = ~"($r1|$r2|$r3)"
    
    data = " sdfg a=11 fge124 b:22 c@33 a=444 ---"
    
    (data=~rx).each{
      //drop first 2 items (whole match and first group) 
      //and find not null over other groups
      println it.drop(2).find() 
    }
    

    result:

    11
    22
    33
    444
    

    why need to drop 2 elements?

    in a global pattern the capturing groups and corresponding indexes in result

       ..(.(.).(.).(.).)..
    0  |-----------------|    whole match - rx
    1    (-------------)      first group declared in rx
    2      (-)                second group (r1)
    3          (-)                          r2
    4              (-)                      r3