I need a way of extracting the main text from any webpage that displays an article. Similar to the way that Readability can find the main text on any website that it's run on.
I'm using Ruby on Rails, so I think Hpricot is my best bet. Is what I'm looking for possible in Hpricot? Is there an example somewhere? Thanks for reading.
You certainly can use Hpricot to scrape content from any given HTML page.
Here is a step-by-step tutorial: http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/
Hpricot is ideal for parsing a file with a known HTML structure using XPath expressions.
However, you will struggle to write anything generic that can read any web page and identify the main article text. I think you'd need some sort of rudimentary AI for that (at least) which is well outside the scope of what Hpricot can do.
What you could do is perhaps write a set of code for the common HTML formats you want to scrape (perhaps Wordpress, Tumblr, Blogger etc) if there is such a set.
I am also sure you could come up with some heuristics for attempting it as well (which based on how well Readability works is what I guess they do - it seems it works far from perfectly)
First stab at a heuristic:
1) Identify (a fixed) set of tags which could be considered to be part of "the main block of text" (e.g. <p>
<br>
<img>
etc).
2) Scrape page and find the largest block of text on the page that only contains tags in (1).
3) Return text from (2) with tags from (1) removed.
Looking at the results of Readability, I reckon this heuristic would work about as well.