crawler4j

How to let crawler4j fetch page by relative path?


With Crawler4j, I can fetch page linked by a complete url, such as:

<a href='http://www.domain.com/thelink'>

However I found that if the link is relative, such as:

<a href='/thelink'>

Crawler4j will bypass this link(page), and I even have no chance to see the link in shouldVisit(Page referringPage, WebURL url) method.

I do not see any configuration about this in Crawler4j Github page, do I miss something?


Solution

  • As described in the related issue on the project page, it seems that this behaviour is related to the fact, that this specific web-page does a lot of rendering content using ajax / javascript.

    However, crawler4j is not able to render javascript styling on demand as it does not include a javascript engine for this purpose. In addition, the script tag is not scanned for URLS yet.