domextractweb-crawlerjaxer

Is there a server-side dom engine suitable for crawling?


I found a project, jaxer which embeds Firefox's JavaScript engine on the server side, so it can parse HTML server-side very well. But, this project seems dead. It is really helpful for crawling web pages to parse HTML & extract data.

Is there some new technology useful for extracting information?


Solution

  • What I've done in the past is use Selenium RC to control a web browser (usually firefox) from code to load and parse websites using a real web browser.

    The cool thing about this is that you're mostly coding in a language you're comfortable with be it Perl or Ruby or C#. But to fully use the power of Selenium you still need to know and write javascript.