regexsedhtml-parsing

delete html comment tags using regexp


This is how my text (html) file looks like
    <!--
     |                                |
     |  This is a dummy comment       |
     |      please delete me          |
     |         asap                   |
     |                                |
      ________________________________
     | -->

    this is another line 
    in this long dummy html file...
    please do not delete me

I'm trying to delete the comment using sed :

cat file.html | sed 's/.*<!--\(.*\)-->.*//g'

It doesn't work :( What am I doing wrong?

Thank you very much for your help!


Solution

  • patrickmdnet has the correct answer. Here it is on one line using extended regex:

    cat file.html | sed -e :a -re 's/-->/\x00/g;s/<!--[^\x00]*\x00//g;/<!--/N;//ba'
    

    Here is a good resource for learning more about sed. This sed is an adaptation of one-liner #92

    http://www.catonmat.net/blog/sed-one-liners-explained-part-three/

    2024 edit: changed the unsupported non-greedy *? to first replace all end-comment tags with a null character and use a character set to be non greedy. This will break if you have any '-->' in you file that you'd like to keep. If this isn't some quick and dirty one liner, you shouldn't be using sed for this. Use an html parser! That's what their made for.