I found a project, jaxer which embeds Firefox's JavaScript engine on the server side, so it can parse HTML server-side very well. But, this project seems dead. It is really helpful for crawling web pages to parse HTML & extract data.
Is there some new technology useful for extracting information?
What I've done in the past is use Selenium RC to control a web browser (usually firefox) from code to load and parse websites using a real web browser.
The cool thing about this is that you're mostly coding in a language you're comfortable with be it Perl or Ruby or C#. But to fully use the power of Selenium you still need to know and write javascript.