web-crawlercrawler4j

Crawler4J null, while processing(link) error


I have a small project which I'm trying to crawl a few million pages Using crawler4j 4.1 (I don't have a definite estimation of the number). I'm using the BasicCrawler example only made some minor changes on it. a little while after I start crawling Crawler4J logs shows the following errors appearing constantly

[Crawler 1] ERROR edu.uci.ics.crawler4j.crawler.WebCrawler - null, while processing: http://www.somelink.com.

I've tried raising the politeness policy time up to 1000 milliseconds and even tried running the crawler with a single thread but the same thing kept on happening.

plus, on the long run crawler4J seems to hang randomly in which I had to stop it and restart it every time it froze.

any idea of what might be causing this? and does Crawler4J reschedule unreachable links back into the frontier or not?

Thanks


Solution

  • Although I'm not really sure of what is causing this error but, I tried to keep track of all the crawled links and those that still in the frontier. I can confirm two things.

    1. Links that are unreachable will be rescheduled in the frontier and the crawler will try to visit them again.
    2. The freezing only happens on pages that exceed the maximum Download size. as a turn around I increased the download size limit and added some extensions on the to be discarded list, not an optimal solution but it did the trick for me.