awkunix-text-processing

gsub: remove till first occurence instead of last occurence of a given character in a line


I have an html file which I basically try to remove first occurences of <...> with sub/gsub functionalities.

I used awk regex . * + according to match anything between < >. However first occurence of > is being escaped (?). I don't know if there is a workaround.

sample input file.txt (x is added not to print empty):

<div>fruit</div></td>x
<span>banana</span>x
<br/>apple</td>x

code:

awk '{gsub(/^<.*>/,""); print}' file.txt

current output:

x
x
x

desired output:

fruit</div></td>x
banana</span>x
apple</td>x

Solution

  • With your shown samples, please try following awk code. Simple explanation would be, using sub substitute function of awk programing. Then substituting starting < till(using [^>] means till first occurrence of > comes) > including > with NULL in current line, finally print edited/non-edited line by 1.

    awk '{sub(/^<[^>]*>/,"")} 1' Input_file
    


    2nd solution: Using match function of awk here match values from 1st occurrence of < to till 1st occurrence of > and print the rest of line.

    awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH)}' Input_file
    

    OR In case you have lines which are not starting from < and you want to print them also then use following:

    awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH);next} 1' Input_file