htmlbashcurlw3m

How do I extract content from a webpage with certain headers in bash?


So far I am using curl along w3m and sed to extract portions of a webpage like <body>....content....</body>. I want to ignore all the other headers (ex. <a></a>, <div></div>). Except the way I am doing it right now is really slow.

curl -L "http://www.somewebpage.com" | sed -n -e '\:<article class=:,\:<div id="below">: p' > file.html 
w3m -dump file.html > file2.txt

These two lines above are really slow because curl was to first save the whole webpage into a file and phrase it, then w3m phrases it and saves it into another file. I just want to simply this code. I was wondering if there was a way with lynx or hmtl2text that lets you extract webpage content with specified headers. So like if I wanted to extract something from as webpage (www.badexample.com <---not actually the link) with this content:

<title>blah......blah...</title>
            <body>
                 Some text I need to extract
            </body>
 more stuffs

Is there a program which i can specify the parameter in which to extract the content? So I would specify someprogram <body></body> www.badexample.com and it would extract the content only in those headers?


Solution

  • You can use Perl's one liner for this:

    perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /<$ARGV[1]>(.*?)<\/$ARGV[1]>/;" http://www.example.com/ title
    

    Instead of the html tag, you can pass the whole regex as well:

    perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /$ARGV[1]/;" "http://www.example.com/" "<body>(.*?)</body>"