I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.
This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm