So far I am using curl
along w3m
and sed
to extract portions of a webpage like <body>
....content....</body>
. I want to ignore all the other headers (ex. <a></a>
, <div></div>
). Except the way I am doing it right now is really slow.
curl -L "http://www.somewebpage.com" | sed -n -e '\:<article class=:,\:<div id="below">: p' > file.html
w3m -dump file.html > file2.txt
These two lines above are really slow because curl
was to first save the whole webpage into a file and phrase it, then w3m
phrases it and saves it into another file. I just want to simply this code. I was wondering if there was a way with lynx
or hmtl2text
that lets you extract webpage content with specified headers. So like if I wanted to extract something from as webpage (www.badexample.com <---not actually the link) with this content:
<title>blah......blah...</title>
<body>
Some text I need to extract
</body>
more stuffs
Is there a program which i can specify the parameter in which to extract the content? So I would specify someprogram <body></body> www.badexample.com
and it would extract the content only in those headers?
You can use Perl's one liner for this:
perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /<$ARGV[1]>(.*?)<\/$ARGV[1]>/;" http://www.example.com/ title
Instead of the html tag, you can pass the whole regex as well:
perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /$ARGV[1]/;" "http://www.example.com/" "<body>(.*?)</body>"