algorithmhtmlhtml-content-extraction

What algorithms could I use to identify content on a web page


I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.


Solution

  • This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm