I have an html file which I basically try to remove first occurences of <...>
with sub
/gsub
functionalities.
I used awk regex .
*
+
according to match anything between <
>
. However first occurence of >
is being escaped (?). I don't know if there is a workaround.
sample input file.txt
(x
is added not to print empty):
<div>fruit</div></td>x
<span>banana</span>x
<br/>apple</td>x
code:
awk '{gsub(/^<.*>/,""); print}' file.txt
current output:
x
x
x
desired output:
fruit</div></td>x
banana</span>x
apple</td>x
With your shown samples, please try following awk
code. Simple explanation would be, using sub
substitute function of awk
programing. Then substituting starting <
till(using [^>]
means till first occurrence of >
comes) >
including >
with NULL in current line, finally print edited/non-edited line by 1
.
awk '{sub(/^<[^>]*>/,"")} 1' Input_file
2nd solution: Using match
function of awk
here match values from 1st occurrence of <
to till 1st occurrence of >
and print the rest of line.
awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH)}' Input_file
OR In case you have lines which are not starting from <
and you want to print them also then use following:
awk 'match($0,/^<[^>]*>/){print substr($0,RSTART+RLENGTH);next} 1' Input_file