regexgreptextwrangler

Substituting multiple occurrences of a character inside a grep match


I am trying to use TextWrangler to take a bunch of text files, match everything within some angle-bracket tags (so far so good), and for every match, substitute all occurrences of a specific character with another.

For instance, I'd like to take something like

xx+xx <f>bar+bar+fo+bar+fe</f> yy+y <f>fee+bar</f> zz

match everything within <f> and </f> and then substitute all +'s with, say, *'s (but ONLY inside the "f" tag).

xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz

I think I can easily match "f" tags containing +'s with an expression like

<f>[^<]*\+[^<]*</f>

but I have no idea on how to substitute only a subclass of character for each match. I don't know a priori how many +'s there are in each tag. I think I should run a regular expression for all matches of the first regular expression, but I am not really sure how to do that.

(In other words, I would like to match all +'s but only inside specific angle-bracket tags).

Does anyone have a hint?

Thanks a lot, Daniele


Solution

  • In case you're OK with an awk solution:

    $ awk '{
        while ( match($0,/<f>[^<]*\+[^<]*<\/f>/) ) {
            tgt = substr($0,RSTART,RLENGTH)
            gsub(/\+/,"*",tgt)
            $0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
        }
        print
    }' file
    xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz
    

    The above will work using any awk in any shell on any UNIX box. It relies on there being no < within each <f>...</f> as indicated by your sample code. If there can be then include that in your example and we can tweak the script to handle it:

    $ awk '{
        gsub("</f>",RS)
        while ( match($0,/<f>[^\n]*\+[^\n]*\n/) ) {
            tgt = substr($0,RSTART,RLENGTH)
            gsub(/\+/,"*",tgt)
            $0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
        }
        gsub(RS,"</f>")
        print
    }' file
    xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz