I have a small project which I'm trying to crawl a few million pages Using crawler4j 4.1 (I don't have a definite estimation of the number). I'm using the BasicCrawler example only made some minor changes on it. a little while after I start crawling Crawler4J logs shows the following errors appearing constantly
[Crawler 1] ERROR edu.uci.ics.crawler4j.crawler.WebCrawler - null, while processing: http://www.somelink.com.
I've tried raising the politeness policy time up to 1000 milliseconds and even tried running the crawler with a single thread but the same thing kept on happening.
plus, on the long run crawler4J seems to hang randomly in which I had to stop it and restart it every time it froze.
any idea of what might be causing this? and does Crawler4J reschedule unreachable links back into the frontier or not?
Thanks
Although I'm not really sure of what is causing this error but, I tried to keep track of all the crawled links and those that still in the frontier. I can confirm two things.