phpweb-crawler

How can I crawl a page for data as if it's logged in if I have the login credentials?


I need to scrape some data from a page that doesn't belong to my domain. I know how to load up the page server side and parse it in various different languages (asp.net, PHP etc) however, I need to scrape the page after it's been logged in.

For example the page would have an HTML tag with an attribute set to the user ID like so:

<div id="profile" data-userid="1234"></div>

The data-userid attribute wouldn't have an ID in it unless logged in. Is it possible to login to a site on the server side? (I do have login credentials)

Thanks,

Thomas


Solution

  • Yes. You need to use an HTTP component in your crawler that is session-aware; you logon programmatically, and with each crawl supply the cookie that you get from your logon action. Test suites often have such a components - see for example SimpleTest.