web-scrapinghtml-agility-packnavnavigationbar

How can I scrape a website for the nav menu only


I'm building a program that scrapes a website. It looks at the entire website and takes only the header and footer navigation menus from that website, then inserts new html tags (div, p, table, etc.) in between the header and footer menus.

I'm looking for some ideas on how to strip only the header and footer nav menus, as well as add code in between the two.

I'm using HTML Agility Pack and have worked on a few methods.

Method 1:

In most cases, the header and footer navigation menus are mostly links, and have very little text. I used a threshold variable that was a ratio of text to links. If the ratio text:links for a node is less than the threshold, the node would be considered a menu node, and it would be saved. Any node whose text:links ratio was greater than the threshold value would be removed.

Method 1 worked for some sites, but not for others, so I ditched it.

Method 2:

I searched each node for an id or class attribute that included "nav" or "menu". "n","a","v", "m","e","n","u" could have been upper case or lower case, and "nav" and "menu" could have been surrounded by any combination of characters. That way, it would include id's and classes such as "bottomNav", "navRight1", "LeftMenu2", etc. If the id or class contained either "nav" or "menu", the node would be saved. If the node's attributes did not contain either of those terms, or any of the node's descendants did not contain either of those terms, the node would be deleted.

Again, method 2 worked for some sites, but not for others.

For the sites where either of these methods worked, I still wasn't able to put new html code in between the two menus, because I had no way of telling where the header menu ended, and where the footer menu began.

I'm just looking for other ideas on how to scrape only the header and footer navigation menus from a website, and insert new html code in between the two.


Solution

  • Other than looking for specific elements or element classes (header, nav, ...), you can try to look at the problem in a different way:

    This common structure should consist mostly of headers, footers, navbars and other elements more or less constant across each website.

    A final step might be to look in this common structure for small gaps caused by headers/footers that vary depending on context, as opposed to large gaps caused by different (main) content, and scrape their possible values from the largest set of pages you can fetch from each website.