linuxubuntusedgrepcomputer-forensics

How to extract content between tags in html using grep command


I want to write a grep command which will extract content between h1 tags irrespective of class and other attributes

I tried

 grep -o '>.*</h1>' Email.txt

But gave only three elements


Solution

  • With GNU grep, you may use

    grep -oP '<h1(?:\s[^>]*)?>\K.*?(?=</h1>)' Email.txt
    

    The -P option will enable PCRE regex engine and the pattern will match