linuxbashshellawk

using AWK to remove characters match with html tag


I want to remove every HTML tag with AWK using this regex: /[<.*.>]/ if said regex is found in any field. I've been trying to make it work with sub or substr, but I am unable to find the correct logic for this.

Input text:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation<br/><div style="margin-top:6px"><b>veniam:</b></div><br/><div style="margin-top:6px"><b>Confort:< /b></div>Comenzi volan; Cruise-control; Servodirectie;<br/>

Expected Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation veniam: Confort: Comenzi volan; Cruise-control; Servodirectie;


Solution

  • If you're not really parsing HTML but instead just want to remove everything between each <...> pair in a text file, then that'd be this with GNU awk for multi-char RS:

    $ awk -v RS='<[^>]+>' -v ORS= '1' file
    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitationveniam: Confort:Comenzi volan; Cruise-control; Servodirectie;