regexbashshellsedcapture-group

capture repeating regex pattern as one group, sed in bash script


I wrote a working expression that extracts two pieces of data from valid lines of text. The first capture group is the numerical section including periods. The second is the remaining characters of the line as long as the line is valid. A line is invalid if the numerical section ends with a period or the line ends with a number.

1.1 the quick 1-1 (no match due to ending hypen and number)
11.2 brown fox jumped (should return '11.2' and 'brown fox jumped')
1.41.1 over the lazy (should return '1.41.1' and 'over the lazy')
2.1. dog (no match due to numerical section trailing period)

The expression ^((?:[0-9]+\.)+[0-9]+) (.*)[^0-9]$ works when tested on various regex testing sites.

My issue is... that I have failed to adapt this expression to work with sed from a bash script that loops through lines of text ($L).

IFS=$'\t' read -r NUM STR < <(sed 's#^\(\(?:[0-9]\+\.\)\+[0-9]\+\) \(.*)[^0-9]$#\1\t\2#p;d' <<< $L )

What does work is below where I replaced the capturing of repeating groups with repeating digits and periods. I would prefer not to do this because it could match lines starting with periods and multiple periods in a row. Also it loses the last char of the captured string but I expect I can figure that part out.

FS=$'\t' read -r NUM STR < <(sed 's#^\([0-9\.]\+[0-9]\+\) \(.*[^0-9]\)$#\1\t\2#p;d' <<< $L )

Please help me understand what I'm doing wrong. Thank you.


Solution

  • An ERE for that would be:

    ^([0-9]+(\.[0-9]+)*) (.*[^0-9])$
    

    with \1 and \3 being the capture groups of interest

    But I'm not sure that using sed + read is the best approach for capturing the data in variables; you could just use bash builtins instead:

    #!/bin/bash
    
    while IFS=' ' read -r num str
    do
        [[ $num =~ ^([0-9]+(\.[0-9]+)*)$ && $str =~ [^0-9]$ ]] || continue
        declare -p num str
    done < input.txt
    

    There's a side-effect with this solution though: The read will strip the leading, trailing and the first middle space++ chars of the line.

    If you need those spaces then you can match the whole line instead:

    #!/bin/bash
    
    regex='^([0-9]+(\.[0-9]+)*) (.*[^0-9])$'
    
    while IFS='' read -r line
    do
        [[ $line =~ $regex ]] || continue
        num=${BASH_REMATCH[1]}
        str=${BASH_REMATCH[3]}
        declare -p num str
    done < input.txt