htmlbashsedgrep

Extract Text between HTML tags with sed or grep


I have a Problem. I want to get two parts of this html in values with the sed or grep command. How i can extract both of them?

test.html:

<html>
 <body>
  <div id="foo" class="foo">
   Some Text.
    <p id="author" class="author">
     <br>
     <a href="example.com">bar</a>
    </p>
  </div>
 </body>
</html>

script.sh

#!/bin/bash

author=$(sed 's/.*<p id="author" class="author"><br><a href="*">\(.*\)<\/a><\/p>.*/\1/p' test.html)
quote=$(sed 's/.*<div id="foo" class="foo">\(.*\)<\/div>.*/\1/p' test.html)

Under the line i want only the text in the values. without the html tags. But my script doesent works..


Solution

  • The code:

    text="$(sed 's:^ *::g' < test.html | tr -d \\n)"
    author=$(sed 's:.*<p id="author" class="author"><br><a href="[^"]*">\([^<]*\)<.*:\1:' <<<"$text")
    quote=$(sed 's:.*<div id="foo" class="foo">\([^<]*\)<.*:\1:' <<<"$text")
    echo "'$author' '$quote'"
    

    How it works:

    1. $text is assigned an unindented single-line representation of test.html; note that : is used as a delimiter for sed instead of /, since any character is capable of being a delimiter, and the text we are parsing has /-s present, so we don`t have to escape them with \-s when constructing a regex.
    2. $author is assumed to be between <p id="author" class="author"><br><a href="[^"]*"> (where [^"]* means «any characters except ", repeated N times, N ∈ [0, +∞)») and any tag that comes next.
    3. $quote is assumed to be between <div id="foo" class="foo"> and any tag that comes next.
    4. The rather obscure construct <<<"$text" is the so-called here-string, which is almost equivalent to echo "$text" | placed at the beginning.