With Crawler4j, I can fetch page linked by a complete url, such as:
<a href='http://www.domain.com/thelink'>
However I found that if the link is relative, such as:
<a href='/thelink'>
Crawler4j will bypass this link(page), and I even have no chance to see the link in shouldVisit(Page referringPage, WebURL url)
method.
I do not see any configuration about this in Crawler4j Github page, do I miss something?
As described in the related issue on the project page, it seems that this behaviour is related to the fact, that this specific web-page does a lot of rendering content using ajax / javascript.
However, crawler4j
is not able to render javascript styling on demand as it does not include a javascript engine for this purpose. In addition, the script
tag is not scanned for URLS yet.